2.4 Phylogenetic reconstruction [edit]

image.png

In this chapter we'll begin to explore the goals, approaches, and challenges for creating phylogenetic trees, or phylogenies. Phylogenies, such as the two presented in Figure 1, represent hypotheses about the evolutionary history of a groups, who are represented by the tips in the tree. You can explore an interactive version of the three-domain tree presented in Figure 1b online, through the Interactive Tree of Life project.

Figure 1a: Evolutionary tree presented by Charles Darwin in On the Origin of Species. Figure 1b: A hypothesis of evolutionary relationships between the three domains of life. This image was created by the Norman Pace Laboratory.

2.4.1 Why build phylogenies? [edit]

Reconstructing the phylogeny of a group of individuals is useful for many reasons. Probably the most obvious of these is understanding the evolutionary relationship between a group of organisms. For example, over the past half-century we've gained great insight into the evolution of our species, Homo sapiens, by studying features of our closest relatives, both extant (still existing) organisms, such as the Pan (chimpanzees) and Gorilla genera, and extinct (no longer living) species, including Homo neanderthalensis, Homo erectus, Homo habilus, and many species in the Australopithecus genus. (The Smithsonian Museum's Human Origins Initiative is an excellent resource for learning more about the fascinating subject of human evolution.) image.png

image.png

image.png

image.png

image.png

image.png

We've also improved our understanding of the deeper branches in the tree of life.

For example, Woese and Fox (1977) used phylogenetic reconstruction to first illustrate that the "prokaryotes" really represented two ancient lineages which they called the eubacteria and the archaebacteria, which ultimately led to the proposal of a "three domain" tree of life composed of the three deep branching linages, the archaea, the bacteria, and the eucarya (Woese, Kandler and Wheelis (1990)).

Phylogenetic trees such as these are also useful for understanding evolution itself.

In fact, they're so useful that the single image that Charles Darwin found important enough to include in On the Origin of Species was the phylogenetic tree presented in Figure 1a.

The individuals represented at the tips of our trees don't necessarily have to be organisms though. In another important application of phylogenetic trees, we can study the evolution of genes, which can help us gain a deeper understanding of gene function. Specifically, we can learn about families of related genes.

A classic example of this is the globin family, which includes the proteins hemoglobin and myoglobin, molecules that can reversibly bind oxygen (meaning they can bind to it, and then let go of it). You've probably heard of hemoglobin (if not globins in general), as this molecule binds to oxygen where it is present in high concentration (such as in your lung) and releases it where it is present in low concentration (such as in the bicep, where it is ultimately used to power your arm). Hemoglobin and myoglobin are paralogs, meaning that they are related by a gene duplication and subsequent divergence. If you were to compare an unknown globin sequence to either of these you could detect homology, but a tree such as the one present in Figure 2, would help you understand the type of homologous relationship (i.e., whether it was orthology or paralogy).

Figure 2: A tree representing members of the globin gene family from diverse taxa. This image is an unmodified reproduction of Figure 5 from A phylogenomic profile of globins by Vinogradov et al (2006).

Phylogenetic trees are used for many other diverse applications in bioinformatics, so it's therefore important that a bioinformatician have an understanding of they are built and how they should be interpreted.

An additional application that we'll cover in this text is comparing the composition of communities of organisms, but we'll come back to that later.

2.4.2 How phylogenies are reconstructed [edit]

Phylogenies are reconstructed using a variety of different algorithms, some of which we'll cover in this chapter. These algorithms all work by comparing a set of features of organisms, and inferring the evolutionary distance between those organisms based on the similarity of their features. The features that are compared can be nearly anything that is observable, either from extant organisms or fossilized representatives of extinct organisms.

As an example, let's consider the reconstruction of the phylogeny of spiders (the order Araneae), a hypothesis of which is presented in Figure 3. Of the extant spiders, some are orb-weavers (meaning they spin circular, sticky webs), and others are not. Entomologists have debated whether orb-weaving is a monophyletic trait (meaning that it evolved one time), or whether it is polyphyletic (meaning that it evolved multiple times, such as flight, which has evolved independently in birds, flying dinosaurs, insects, and mammals). If orb-weaving is monophyletic, it would mean that over the course of evolution, extant spiders which don't weave orb webs have lost that ability. Some researchers doubt this as it's a very effective means of catching prey, and losing that ability would likely constitute an evolutionary disadvantage. If orb-weaving is polyphyletic, it would means that in at least two different spider lineages, this trait arose independently, which other researchers consider to be very unlikely due to the complexity of engineering these webs. Examples of the evolution of monophyletic and polyphyletic traits are presented in Figures 3 and 4, respectively.

Figure 3: Example phylogeny illustrating a monophyletic trait shared by a group of organisms. In a monophyletic group, the last common ancestor was also member of the group (e.g., multicellular organisms).
Figure 4: Example phylogeny illustrating a polyphyletic trait shared by a group of organisms. In a polyphyletic group the last common ancestor was not a member of the group (e.g., flying animals).

Kahoot ------------

Earlier work on understanding the relations between the spider lineages focused on comparing traits that entomologists would observe, for example by watching spiders in action or by dissecting them. For example, in 1986 through 1991, Johnathan Coddington published several studies that tabulated and compared about sixty features of 32 spider taxa (Coddington J. 1986. The monophyletic origin of the orb web. In: Shear W, ed. Spiders: webs, behavior, and evolution. Stanford, California: Stanford University Press. 319-363; Coddington JA. 1991. Cladistics and spider classification: araneomorph phylogeny and the monophyly of orbweavers (Araneae: Araneomorphae; Orbiculariae) Acta Zoologica Fennica 190:75-87). Features included whether a spider wraps its prey when it attacks (a behavioral trait), and how branched the spider's trachea is (a morphological trait). By determining which spiders were more similar and different across these traits, Dr. Coddington provided early evidence for the hypothesis that orb-weaving is an ancient monophyletic trait.

More recently, several research teams have used features of spider genomes to reconstruct the spider phylogeny (Bond et al., 2014, Garrison et al., 2016).

  • Using this approach, the features become the nucleotides observed at particular positions in the genome, which are observed first by sequencing specific genes that the researchers target that are present in all members of the group, and then aligning those sequences with multiple sequence alignment.

This has several advantages over feature matrices derived from morphological and behavioral traits, including that many more features can be observed.

For example, (Garrison et al., 2016), compared approximately 700,000 amino acid positions from nearly 4000 loci around the genomes of 70 spider taxa. Compare the number of features here to the number mentioned in the previous paragraph. These phylogenomic studies have further supported the idea that orb-weaving is an ancient monophyletic trait, and have provided much finer scale information on the evolution of spiders. Supported by these data, researchers hypothesize that the loss of orb-weaving might not be that surprising. While it does provide an effective means of catching flying insects, many insects which are potential prey for spiders don't fly. Further, orb webs may attract predators of spiders, as they are easily observable signals of where a spider can be found.

Figure 5: A spider phylogeny. Numbers at internal nodes correspond to the taxonomic groups described in Table 1 of Garrison et al., 2016. This image is an unmodified version of Figure 1 of Garrison et al., 2016.

For the remainder of this chapter, we'll consider methods for phylogenetic reconstruction that use genome sequence data as features.

2.4.3 Some terminology [edit]

Next, let's cover a few terms using the tree diagram in Figure 6.

Figure 6: A schematic of a phylogenetic tree illustrating important terms.

Terminal nodes or tips typically represent extant organisms, also frequently called operational taxonomic units or OTUs.

  • OTU is a generic way of referring to a grouping of organisms (such as a species, a genus, or a phylum), without specifically identifying what that grouping is.

Internal nodes in a phylogenetic tree represent hypothetical ancestors.

  • We postulate their existence but often don't have direct evidence.
  • The root node is the internal node from which all other nodes in the tree descend. This is often referred to as the last common ancestor (LCA) of the OTUs represented in the tree. In a universal tree of life, the LCA is often referred to as LUCA, the last universal common ancestor. All nodes in the tree can be referred to as OTUs.

Branches connect the nodes in the tree, and generally represent time or some amount of evolutionary change between the OTUs.

  • The specific meaning of the branches will depend on the method that was used to build the phylogenetic tree.

A clade in a tree refers to some node (either internal or terminal) and all nodes descending from it (i.e., moving away from the root toward the tips).

We'll use all of these terms below as we begin to explore phylogenetic trees.

2.4.4 Simulating evolution [edit]

Before we jump into how to reconstruct a phylogeny from DNA sequence data, we're going to perform a simulation of the process of evolution of a DNA sequence. In this simulation, we're going to model sequence evolution with a Python function, and then we're going to run that function to simulate multiple generations of evolution.

Bioinformatics developers often use simulations to understand how their algorithms work, as they uniquely provide an opportunity to know what the correct answer is. This provides a way to compare algorithms to each other to figure out which performs best under which circumstances. In our simulation we're going to have control over the starting sequence, and the probability of incurring a substitution mutation or an insertion/deletion mutation at each position of the sequence in each generation.

Our simulation will works as follows.

  • We'll have one function that we primarily interact with called evolve_generations.
    • This will take a starting sequence and the number of generations that we want to simulate.
    • It will also take the probability that we want a substitution mutation to occur at each position in the starting sequence, and the probability that we want either an insertion or deletion mutation to occur at each position.
  • In each generation, every sequence will spawn two new sequences, randomly incurring mutations at the prescribed rates.
    • This effectively simulates a clonal process of reproduction, a form of asexual reproduction common in single cellular organisms, where a parent cell divides into two cells, each containing a copy of the parent's genome with some errors introduced.

Let's inspect this code and then run our simulation beginning with a random sequence.

In [6]:
%pylab inline

import numpy as np
import seaborn as sns
import random

#from IPython.core import page
#page.page = print
Populating the interactive namespace from numpy and matplotlib

First we'll look at the function used to simulate the evolution of a single sequence. This is where most of the important evolutionary modeling of sequence evolution happens.

In [7]:
from iab.algorithms import evolve_sequence
%psource evolve_sequence

Next, take a look at the function that models a single generation of a single sequence. This is where the clonal reproduction (i.e., one parent sequence becoming two child sequences) occurs.

In [8]:
from iab.algorithms import evolve_generation
%psource evolve_generation

Finally, take a look at our entry point function. This is where we provide the parameters of our simulation, including the starting sequence, the number of generations, and the mutation probabilities. Notice how each of these functions builds on the prior functions.

In [9]:
from iab.algorithms import evolve_generations
%psource evolve_generations

Now we'll run our simulation. We'll start with a random DNA sequence, and then evolve three generations.

Before running this, can you predict how many child sequences we'll end up with after three generations?

When we call evolve_generations, we'll pass the parameter verbose=True. This will tell the function to print out some information throughout the process. This will let us inspect our evolutionary process: something that is impossible to do with real sequences from history, though we can observe microbial evolution in the lab.

In [10]:
from iab.algorithms import random_sequence
import skbio
sequence = random_sequence(skbio.DNA, 50)
In [11]:
sequences = evolve_generations(sequence, generations=3, 
                               substitution_probability=0.1, 
                               indel_probability=0.05,
                               increased_rate_probability=0.1, 
                               verbose=True)
Generation: 0 (Number of parent sequences: 1)
ACTCTTTAAGTATGGGCTGTCTCCGATGTAGGTCTCTGCGTTCTTAGATG (last common ancestor)

ACTCTTTAAGTATGGGCTGTCTCCGATGTAGGTCTCTGCGTTCTTAGATG (parent id: 0)
ACTCTTTAAGTATGTGTTGTCTCCGATGTAGGTCTCTGCCCGATTCTTAGATG (child 1 id: 0.1)
ACTCTTTACTGTCTCGGAGGTAGGTCTCTGCGTTCTTATAACG (child 2 id: 0.2)

Generation: 1 (Number of parent sequences: 2)
ACTCTTTAAGTATGGGCTGTCTCCGATGTAGGTCTCTGCGTTCTTAGATG (last common ancestor)

ACTCTTTAAGTATGTGTTGTCTCCGATGTAGGTCTCTGCCCGATTCTTAGATG (parent id: 0.1)
ACTCTTTAAGTATGCTCCAATGTAGGTCTCTGCCCGTGATTTCTGGAGA (child 1 id: 0.1.1)
ACCAGCTTTAAGTATGAGTCTCCTATGTAGTCTCTGCCCGAAACTTAAATGCATG (child 2 id: 0.1.2)

ACTCTTTACTGTCTCGGAGGTAGGTCTCTGCGTTCTTATAACG (parent id: 0.2)
ACACGGTCTCGGAGGTGGGTCTCTGCCTTCTTATTCAGG (child 1 id: 0.2.1)
ACTCTTTACTGTCATCTGGAGGTACATCTCTGCATGCTTATAACG (child 2 id: 0.2.2)

Generation: 2 (Number of parent sequences: 4)
ACTCTTTAAGTATGGGCTGTCTCCGATGTAGGTCTCTGCGTTCTTAGATG (last common ancestor)

ACTCTTTAAGTATGCTCCAATGTAGGTCTCTGCCCGTGATTTCTGGAGA (parent id: 0.1.1)
ACTCGGACATTAAGTATGCTCCAATGTAGGACTCCGCCCGTAATTTCTGTAGA (child 1 id: 0.1.1.1)
ACTCTTTGAGTAAGCTTTAATGTAGGTCTTTGGCCGTGAATACTGGAGA (child 2 id: 0.1.1.2)

ACCAGCTTTAAGTATGAGTCTCCTATGTAGTCTCTGCCCGAAACTTAAATGCATG (parent id: 0.1.2)
TCCAGCTTTAAGAATGAGTCTCCTATGTAATGTCCCGAAACTTAAACCCGCCATGG (child 1 id: 0.1.2.1)
ACCAGCCTTAAGTACGAGTCTCATATGAGTAAAGTCTTTGCCCTAATTAAATGAATG (child 2 id: 0.1.2.2)

ACACGGTCTCGGAGGTGGGTCTCTGCCTTCTTATTCAGG (parent id: 0.2.1)
ACCCCGGAGGTGGGTCTCTGCCTTCTTATTTGTCAG (child 1 id: 0.2.1.1)
AGTTGCACGGCAGGGAGGTGGGTCTGTGCCTTCAGCAATTGTTCAGG (child 2 id: 0.2.1.2)

ACTCTTTACTGTCATCTGGAGGTACATCTCTGCATGCTTATAACG (parent id: 0.2.2)
ACTCTTTACTGTCATGGACTGGAGGTCCATCTCGGCATGCTTATAACG (child 1 id: 0.2.2.1)
ACTCTTTAGTGTCATCTGGAGAGGTACATCTAGATAGCATGCCTATAG (child 2 id: 0.2.2.2)

We now have a new variable, sequences, which contains the child sequences from the last generation.

Take a minute to look at the ids of the parent and child sequences above, and the ids of a couple of the final generation sequences.

  • These ids are constructed so that each sequence contains the identifiers of its ancestral sequences, and then either 1 or 2.
  • Notice that all sequence identifiers start with 0, the identifier of the last common ancestor (or our starting sequence) of all of the sequences.

These identifiers will help us interpret whether the phylogenies that we reconstruct accurately represent the evolutionary relationships between the sequences.

Also, notice that at this point we only have the sequences from the last generation. We no longer have the ancestral sequences (which would correspond to the internal nodes in the tree). This models the real world, were we only have sequences from extant organisms, but not their ancestors.

Take a minute to compare the two sequences below. What types of mutations happened over the course of their evolution?

In [12]:
print(len(sequences))
8
In [13]:
sequences[0]
Out[13]:
DNA
------------------------------------------------------------
Metadata:
    'id': '0.1.1.1'
Stats:
    length: 53
    has gaps: False
    has degenerates: False
    has non-degenerates: True
    GC-content: 45.28%
------------------------------------------------------------
0 ACTCGGACAT TAAGTATGCT CCAATGTAGG ACTCCGCCCG TAATTTCTGT AGA
In [14]:
sequences[-1]
Out[14]:
DNA
------------------------------------------------------
Metadata:
    'id': '0.2.2.2'
Stats:
    length: 48
    has gaps: False
    has degenerates: False
    has non-degenerates: True
    GC-content: 41.67%
------------------------------------------------------
0 ACTCTTTAGT GTCATCTGGA GAGGTACATC TAGATAGCAT GCCTATAG

In our simulation, each sequence is directly derived from exactly one sequence from the previous generation.

  • The evolution of all of the sequences traces back to starting sequence that we provided.
  • This means that our final sequences are all homologous.

And because we have modeled this process, we know where each sequence fits in relation to all of the other sequences in the phylogeny.

  • Our goal with the algorithms we'll study for the rest of this chapter is to reconstruct that phylogeny given only the last generation of sequences.
  • We'll use the fact that we know the true phylogeny to help us evaluate the relative performance of the different methods.
Figure 7: Schematic of a simulated evolutionary process. Bases in red indicate mutation since the last common ancestor. The bottom panel illustrates the real-world equivalent of our final product, where we wouldn't know the true phylogeny (indicated by the dashed branches), the sequence of the last common ancestor, or what positions have changed since the last common ancestor.

Let's simulate 10 generations of sequences here, and then randomly select some of those sequences to work with in the remaining sections of this chapter. For the sake of runtime, I'm going to set our indel_probability=0.0. This means that we won't need to align our selected sequences before constructing trees from them (because with only substitution mutations occurring, homologous positions always remain aligned). Multiple sequence alignment is currently a runtime bottleneck in IAB, so this will let us run this notebook much faster. If you'd like to model insertion/deletions, you can increase the indel_probability, say to 0.005. If you do that, the sequences will be aligned for you in the next cell, but it may take around 30 minutes to run.

In [15]:
from skbio.alignment import global_pairwise_align_nucleotide
from iab.algorithms import progressive_msa
from functools import partial
In [16]:
indel_probability = 0.0

sequences = evolve_generations(sequence, 
                               generations=10, 
                               substitution_probability=0.03,
                               indel_probability=0.0, 
                               increased_rate_probability=0.1, 
                               verbose=False)

sequences = random.sample(sequences, 25)
if indel_probability == 0:
    sequences_aligned = sequences
else:
    gpa = partial(global_pairwise_align_nucleotide, 
                  penalize_terminal_gaps=True)
    
    sequences_aligned = progressive_msa(sequences, 
                                        pairwise_aligner=gpa)

2.4.4.1 A cautionary word about simulations [edit]

While simulations are extremely powerful for comparing algorithms, they can also be misleading. This is in part because when we model evolution we simplify the evolutionary process, and the model is stochastic. One assumption that our simulation is making is that "bursts" of evolution (i.e., where our substitution rate temporarily increases) are restricted to only a single generation. A child is no more or less like to have an increased rate of substitutions if its parent did. This may or may not be an accurate model. Imagine in the real-world that the environment changed drastically for some descendants (for example, if a geological event created new thermal vents in a lake they inhabited, resulting in an increase in mean water temperature), but not for others. The lineages who experience the environmental change might have an increased rate of substitutions as their genomes adapt to the new environment. The increased substitution rate may persist for multiple generations, or it may not. This is one limitation of our simulation that we know about, but because we don't have a perfect understanding of sequence evolution, there are limitations that we don't know about.

When using simulations, it's important to understand what assumptions the simulation makes so you know what it can tell you about and what it can't tell you about. You'll want to consider that when determining how confident you are in the results of an evaluation based on simulation.

Why question: What are some other assumptions that are being made by the evolutionary simulation presented here? There are many, so take a minute to list a few.

On the opposite end of the spectrum from simulations for algorithm comparison is comparisons based on real data, where we don't know groud-truth.

  • The trade-off is that, while there are no assumptions being made in real data with respect to the composition of the sequences, with real data we don't usallly know what the right answer (in our case, the correct phylogeny) is (unless we record it in the a microbiology lab), so it's harder to determine which algorithms are doing better or worse.

The take-away message here is that neither approach is perfect, and often researchers will use a combination of simulated and real data to evaluate algorithms.

2.4.5 Visualizing trees with ete3 [edit]

As we now start computing phylogenetic trees, we're going to need a way to visualize them. We'll use the ete3 Python package for this, and in the next cell we'll configure the TreeStyle which is used to define how the trees we visualize will look. If you'd like to experiment with other views, you can modify the code in this cell according to the ete3 documentation. If you come up with a nicer style, I'd be interested in seeing that in a pull request. You can post screenshots to IAB issue #213 before creating a pull request so I can see what the new style looks like.

In [17]:
import ete3
help(ete3.Tree)
Help on class TreeNode in module ete3.coretype.tree:

class TreeNode(builtins.object)
 |  TreeNode (Tree) class is used to store a tree structure. A tree
 |  consists of a collection of TreeNode instances connected in a
 |  hierarchical way. Trees can be loaded from the New Hampshire Newick
 |  format (newick).
 |  
 |  :argument newick: Path to the file containing the tree or, alternatively,
 |     the text string containing the same information.
 |  
 |  :argument 0 format: subnewick format
 |  
 |    .. table::
 |  
 |        ======  ==============================================
 |        FORMAT  DESCRIPTION
 |        ======  ==============================================
 |        0        flexible with support values
 |        1        flexible with internal node names
 |        2        all branches + leaf names + internal supports
 |        3        all branches + all names
 |        4        leaf branches + leaf names
 |        5        internal and leaf branches + leaf names
 |        6        internal branches + leaf names
 |        7        leaf branches + all names
 |        8        all names
 |        9        leaf names
 |        100      topology only
 |        ======  ==============================================
 |  
 |  :returns: a tree node object which represents the base of the tree.
 |  
 |  **Examples:**
 |  
 |  ::
 |  
 |      t1 = Tree() # creates an empty tree
 |      t2 = Tree('(A:1,(B:1,(C:1,D:1):0.5):0.5);')
 |      t3 = Tree('/home/user/myNewickFile.txt')
 |  
 |  Methods defined here:
 |  
 |  __add__(self, value)
 |      This allows to sum two trees.
 |  
 |  __and__(self, value)
 |      This allows to execute tree&'A' to obtain the descendant node
 |      whose name is A
 |  
 |  __bool__(self)
 |      Python3's equivalent of __nonzero__
 |      If this is not defined bool(class_instance) will call
 |      __len__ in python3
 |  
 |  __contains__(self, item)
 |      Check if item belongs to this node. The 'item' argument must
 |      be a node instance or its associated name.
 |  
 |  __init__(self, newick=None, format=0, dist=None, support=None, name=None, quoted_node_names=False)
 |      Initialize self.  See help(type(self)) for accurate signature.
 |  
 |  __iter__(self)
 |      Iterator over leaf nodes
 |  
 |  __len__(self)
 |      Node len returns number of children.
 |  
 |  __nonzero__(self)
 |  
 |  __repr__(self)
 |      Return repr(self).
 |  
 |  __str__(self)
 |      Print tree in newick format.
 |  
 |  add_child(self, child=None, name=None, dist=None, support=None)
 |      Adds a new child to this node. If child node is not suplied
 |      as an argument, a new node instance will be created.
 |      
 |      :argument None child: the node instance to be added as a child.
 |      :argument None name: the name that will be given to the child.
 |      :argument None dist: the distance from the node to the child.
 |      :argument None support: the support value of child partition.
 |      
 |      :returns: The child node instance
 |  
 |  add_face(self, face, column, position='branch-right')
 |      .. versionadded: 2.1
 |      
 |      Add a fixed face to the node.  This type of faces will be
 |      always attached to nodes, independently of the layout
 |      function.
 |      
 |      :argument face: a Face or inherited instance
 |      :argument column: An integer number starting from 0
 |      :argument "branch-right" position: Posible values are:
 |        "branch-right", "branch-top", "branch-bottom", "float",
 |        "aligned"
 |  
 |  add_feature(self, pr_name, pr_value)
 |      Add or update a node's feature.
 |  
 |  add_features(self, **features)
 |      Add or update several features.
 |  
 |  add_sister(self, sister=None, name=None, dist=None)
 |      Adds a sister to this node. If sister node is not supplied
 |      as an argument, a new TreeNode instance will be created and
 |      returned.
 |  
 |  check_monophyly(self, values, target_attr, ignore_missing=False, unrooted=False)
 |      .. versionadded: 2.2
 |      
 |      Returns True if a given target attribute is monophyletic under
 |      this node for the provided set of values.
 |      
 |      If not all values are represented in the current tree
 |      structure, a ValueError exception will be raised to warn that
 |      strict monophyly could never be reached (this behaviour can be
 |      avoided by enabling the `ignore_missing` flag.
 |      
 |      :param values: a set of values for which monophyly is
 |          expected.
 |      
 |      :param target_attr: node attribute being used to check
 |          monophyly (i.e. species for species trees, names for gene
 |          family trees, or any custom feature present in the tree).
 |      
 |      :param False ignore_missing: Avoid raising an Exception when
 |          missing attributes are found.
 |      
 |      
 |      .. versionchanged: 2.3
 |      
 |      :param False unrooted: If True, tree will be treated as unrooted, thus
 |        allowing to find monophyly even when current outgroup is splitting a
 |        monophyletic group.
 |      
 |      :returns: the following tuple
 |                IsMonophyletic (boolean),
 |                clade type ('monophyletic', 'paraphyletic' or 'polyphyletic'),
 |                leaves breaking the monophyly (set)
 |  
 |  compare(self, ref_tree, use_collateral=False, min_support_source=0.0, min_support_ref=0.0, has_duplications=False, expand_polytomies=False, unrooted=False, max_treeko_splits_to_be_artifact=1000, ref_tree_attr='name', source_tree_attr='name')
 |      compare this tree with another using robinson foulds symmetric difference
 |      and number of shared edges. Trees of different sizes and with duplicated
 |      items allowed.
 |      
 |      returns: a Python dictionary with results
 |  
 |  convert_to_ultrametric(self, tree_length=None, strategy='balanced')
 |      .. versionadded: 2.1
 |      
 |      Converts a tree into ultrametric topology (all leaves must have
 |      the same distance to root). Note that, for visual inspection
 |      of ultrametric trees, node.img_style["size"] should be set to
 |      0.
 |  
 |  copy(self, method='cpickle')
 |      .. versionadded: 2.1
 |      
 |      Returns a copy of the current node.
 |      
 |      :var cpickle method: Protocol used to copy the node
 |      structure. The following values are accepted:
 |      
 |         - "newick": Tree topology, node names, branch lengths and
 |           branch support values will be copied by as represented in
 |           the newick string (copy by newick string serialisation).
 |      
 |         - "newick-extended": Tree topology and all node features
 |           will be copied based on the extended newick format
 |           representation. Only node features will be copied, thus
 |           excluding other node attributes. As this method is also
 |           based on newick serialisation, features will be converted
 |           into text strings when making the copy.
 |      
 |         - "cpickle": The whole node structure and its content is
 |           cloned based on cPickle object serialisation (slower, but
 |           recommended for full tree copying)
 |      
 |         - "deepcopy": The whole node structure and its content is
 |           copied based on the standard "copy" Python functionality
 |           (this is the slowest method but it allows to copy complex
 |           objects even if attributes point to lambda functions,
 |           etc.)
 |  
 |  del_feature(self, pr_name)
 |      Permanently deletes a node's feature.
 |  
 |  delete(self, prevent_nondicotomic=True, preserve_branch_length=False)
 |      Deletes node from the tree structure. Notice that this method
 |      makes 'disappear' the node from the tree structure. This means
 |      that children from the deleted node are transferred to the
 |      next available parent.
 |      
 |      :param True prevent_nondicotomic: When True (default), delete
 |          function will be execute recursively to prevent
 |          single-child nodes.
 |      
 |      :param False preserve_branch_length: If True, branch lengths
 |          of the deleted nodes are transferred (summed up) to its
 |          parent's branch, thus keeping original distances among
 |          nodes.
 |      
 |      **Example:**
 |      
 |      ::
 |      
 |              / C
 |        root-|
 |             |        / B
 |              \--- H |
 |                      \ A
 |      
 |        > H.delete() will produce this structure:
 |      
 |              / C
 |             |
 |        root-|--B
 |             |
 |              \ A
 |  
 |  describe(self)
 |      Prints general information about this node and its
 |      connections.
 |  
 |  detach(self)
 |      Detachs this node (and all its descendants) from its parent
 |      and returns the referent to itself.
 |      
 |      Detached node conserves all its structure of descendants, and can
 |      be attached to another node through the 'add_child' function. This
 |      mechanism can be seen as a cut and paste.
 |  
 |  expand_polytomies(self, map_attr='name', polytomy_size_limit=5, skip_large_polytomies=False)
 |      .. versionadded:: 2.3
 |      
 |      Given a tree with one or more polytomies, this functions returns the
 |      list of all trees (in newick format) resulting from the combination of
 |      all possible solutions of the multifurcated nodes.
 |      
 |      .. warning:
 |      
 |         Please note that the number of of possible binary trees grows
 |         exponentially with the number and size of polytomies. Using this
 |         function with large multifurcations is not feasible:
 |      
 |         polytomy size: 3 number of binary trees: 3
 |         polytomy size: 4 number of binary trees: 15
 |         polytomy size: 5 number of binary trees: 105
 |         polytomy size: 6 number of binary trees: 945
 |         polytomy size: 7 number of binary trees: 10395
 |         polytomy size: 8 number of binary trees: 135135
 |         polytomy size: 9 number of binary trees: 2027025
 |      
 |      http://ajmonline.org/2010/darwin.php
 |  
 |  get_ancestors(self)
 |      versionadded: 2.2
 |      
 |      Returns the list of all ancestor nodes from current node to
 |      the current tree root.
 |  
 |  get_ascii(self, show_internal=True, compact=False, attributes=None)
 |      Returns a string containing an ascii drawing of the tree.
 |      
 |      :argument show_internal: includes internal edge names.
 |      :argument compact: use exactly one line per tip.
 |      
 |      :param attributes: A list of node attributes to shown in the
 |          ASCII representation.
 |  
 |  get_cached_content(self, store_attr=None, container_type=<class 'set'>, leaves_only=True, _store=None)
 |      .. versionadded: 2.2
 |      
 |      Returns a dictionary pointing to the preloaded content of each
 |      internal node under this tree. Such a dictionary is intended
 |      to work as a cache for operations that require many traversal
 |      operations.
 |      
 |      :param None store_attr: Specifies the node attribute that
 |          should be cached (i.e. name, distance, etc.). When none,
 |          the whole node instance is cached.
 |      
 |      :param _store: (internal use)
 |  
 |  get_children(self)
 |      Returns an independent list of node's children.
 |  
 |  get_closest_leaf(self, topology_only=False, is_leaf_fn=None)
 |      Returns node's closest descendant leaf and the distance to
 |      it.
 |      
 |      :argument False topology_only: If set to True, distance
 |        between nodes will be referred to the number of nodes
 |        between them. In other words, topological distance will be
 |        used instead of branch length distances.
 |      
 |      :return: A tuple containing the closest leaf referred to the
 |        current node and the distance to it.
 |  
 |  get_common_ancestor(self, *target_nodes, **kargs)
 |      Returns the first common ancestor between this node and a given
 |      list of 'target_nodes'.
 |      
 |      **Examples:**
 |      
 |      ::
 |      
 |        t = tree.Tree("(((A:0.1, B:0.01):0.001, C:0.0001):1.0[&&NHX:name=common], (D:0.00001):0.000001):2.0[&&NHX:name=root];")
 |        A = t.get_descendants_by_name("A")[0]
 |        C = t.get_descendants_by_name("C")[0]
 |        common =  A.get_common_ancestor(C)
 |        print common.name
 |  
 |  get_descendants(self, strategy='levelorder', is_leaf_fn=None)
 |      Returns a list of all (leaves and internal) descendant nodes.
 |      
 |      :argument None is_leaf_fn: See :func:`TreeNode.traverse` for
 |        documentation.
 |  
 |  get_distance(self, target, target2=None, topology_only=False)
 |      Returns the distance between two nodes. If only one target is
 |      specified, it returns the distance between the target and the
 |      current node.
 |      
 |      :argument target: a node within the same tree structure.
 |      
 |      :argument target2: a node within the same tree structure. If
 |        not specified, current node is used as target2.
 |      
 |      :argument False topology_only: If set to True, distance will
 |        refer to the number of nodes between target and target2.
 |      
 |      :returns: branch length distance between target and
 |        target2. If topology_only flag is True, returns the number
 |        of nodes between target and target2.
 |  
 |  get_edges(self, cached_content=None)
 |      .. versionadded:: 2.3
 |      
 |      Returns the list of edges of a tree. Each edge is represented as a
 |      tuple of two elements, each containing the list of nodes separated by
 |      the edge.
 |  
 |  get_farthest_leaf(self, topology_only=False, is_leaf_fn=None)
 |      Returns node's farthest descendant node (which is always a leaf), and the
 |      distance to it.
 |      
 |      :argument False topology_only: If set to True, distance
 |        between nodes will be referred to the number of nodes
 |        between them. In other words, topological distance will be
 |        used instead of branch length distances.
 |      
 |      :return: A tuple containing the farthest leaf referred to the
 |        current node and the distance to it.
 |  
 |  get_farthest_node(self, topology_only=False)
 |      Returns the node's farthest descendant or ancestor node, and the
 |      distance to it.
 |      
 |      :argument False topology_only: If set to True, distance
 |        between nodes will be referred to the number of nodes
 |        between them. In other words, topological distance will be
 |        used instead of branch length distances.
 |      
 |      :return: A tuple containing the farthest node referred to the
 |        current node and the distance to it.
 |  
 |  get_leaf_names(self, is_leaf_fn=None)
 |      Returns the list of terminal node names under the current
 |      node.
 |      
 |      :argument None is_leaf_fn: See :func:`TreeNode.traverse` for
 |        documentation.
 |  
 |  get_leaves(self, is_leaf_fn=None)
 |      Returns the list of terminal nodes (leaves) under this node.
 |      
 |      :argument None is_leaf_fn: See :func:`TreeNode.traverse` for
 |        documentation.
 |  
 |  get_leaves_by_name(self, name)
 |      Returns a list of leaf nodes matching a given name.
 |  
 |  get_midpoint_outgroup(self)
 |      Returns the node that divides the current tree into two distance-balanced
 |      partitions.
 |  
 |  get_monophyletic(self, values, target_attr)
 |      .. versionadded:: 2.2
 |      
 |      Returns a list of nodes matching the provided monophyly
 |      criteria. For a node to be considered a match, all
 |      `target_attr` values within and node, and exclusively them,
 |      should be grouped.
 |      
 |      :param values: a set of values for which monophyly is
 |          expected.
 |      
 |      :param target_attr: node attribute being used to check
 |          monophyly (i.e. species for species trees, names for gene
 |          family trees).
 |  
 |  get_sisters(self)
 |      Returns an independent list of sister nodes.
 |  
 |  get_topology_id(self, attr='name')
 |      .. versionadded:: 2.3
 |      
 |      Returns the unique ID representing the topology of the current tree. Two
 |      trees with the same topology will produce the same id. If trees are
 |      unrooted, make sure that the root node is not binary or use the
 |      tree.unroot() function before generating the topology id.
 |      
 |      This is useful to detect the number of unique topologies over a bunch of
 |      trees, without requiring full distance methods.
 |      
 |      The id is, by default, calculated based on the terminal node's names. Any
 |      other node attribute could be used instead.
 |  
 |  get_tree_root(self)
 |      Returns the absolute root node of current tree structure.
 |  
 |  is_leaf(self)
 |      Return True if current node is a leaf.
 |  
 |  is_root(self)
 |      Returns True if current node has no parent
 |  
 |  iter_ancestors(self)
 |      versionadded: 2.2
 |      
 |      Iterates over the list of all ancestor nodes from current node
 |      to the current tree root.
 |  
 |  iter_descendants(self, strategy='levelorder', is_leaf_fn=None)
 |      Returns an iterator over all descendant nodes.
 |      
 |      :argument None is_leaf_fn: See :func:`TreeNode.traverse` for
 |        documentation.
 |  
 |  iter_edges(self, cached_content=None)
 |      .. versionadded:: 2.3
 |      
 |      Iterate over the list of edges of a tree. Each edge is represented as a
 |      tuple of two elements, each containing the list of nodes separated by
 |      the edge.
 |  
 |  iter_leaf_names(self, is_leaf_fn=None)
 |      Returns an iterator over the leaf names under this node.
 |      
 |      :argument None is_leaf_fn: See :func:`TreeNode.traverse` for
 |        documentation.
 |  
 |  iter_leaves(self, is_leaf_fn=None)
 |      Returns an iterator over the leaves under this node.
 |      
 |      :argument None is_leaf_fn: See :func:`TreeNode.traverse` for
 |        documentation.
 |  
 |  iter_prepostorder(self, is_leaf_fn=None)
 |      Iterate over all nodes in a tree yielding every node in both
 |      pre and post order. Each iteration returns a postorder flag
 |      (True if node is being visited in postorder) and a node
 |      instance.
 |  
 |  iter_search_nodes(self, **conditions)
 |      Search nodes in an iterative way. Matches are yielded as they
 |      are being found. This avoids needing to scan the full tree
 |      topology before returning the first matches. Useful when
 |      dealing with huge trees.
 |  
 |  ladderize(self, direction=0)
 |      .. versionadded: 2.1
 |      
 |      Sort the branches of a given tree (swapping children nodes)
 |      according to the size of each partition.
 |      
 |      ::
 |      
 |         t =  Tree("(f,((d, ((a,b),c)),e));")
 |      
 |         print t
 |      
 |         #
 |         #      /-f
 |         #     |
 |         #     |          /-d
 |         # ----|         |
 |         #     |     /---|          /-a
 |         #     |    |    |     /---|
 |         #     |    |     \---|     \-b
 |         #      \---|         |
 |         #          |          \-c
 |         #          |
 |         #           \-e
 |      
 |         t.ladderize()
 |         print t
 |      
 |         #      /-f
 |         # ----|
 |         #     |     /-e
 |         #      \---|
 |         #          |     /-d
 |         #           \---|
 |         #               |     /-c
 |         #                \---|
 |         #                    |     /-a
 |         #                     \---|
 |         #                          \-b
 |  
 |  phonehome(self)
 |  
 |  populate(self, size, names_library=None, reuse_names=False, random_branches=False, branch_range=(0, 1), support_range=(0, 1))
 |      Generates a random topology by populating current node.
 |      
 |      :argument None names_library: If provided, names library
 |        (list, set, dict, etc.) will be used to name nodes.
 |      
 |      :argument False reuse_names: If True, node names will not be
 |        necessarily unique, which makes the process a bit more
 |        efficient.
 |      
 |      :argument False random_branches: If True, branch distances and support
 |        values will be randomized.
 |      
 |      :argument (0,1) branch_range: If random_branches is True, this
 |        range of values will be used to generate random distances.
 |      
 |      :argument (0,1) support_range: If random_branches is True,
 |        this range of values will be used to generate random branch
 |        support values.
 |  
 |  prune(self, nodes, preserve_branch_length=False)
 |      Prunes the topology of a node to conserve only the selected list of leaf
 |      internal nodes. The minimum number of nodes that conserve the
 |      topological relationships among the requested nodes will be
 |      retained. Root node is always conserved.
 |      
 |      :var nodes: a list of node names or node objects that should be retained
 |      
 |      :param False preserve_branch_length: If True, branch lengths
 |        of the deleted nodes are transferred (summed up) to its
 |        parent's branch, thus keeping original distances among
 |        nodes.
 |      
 |      **Examples:**
 |      
 |      ::
 |      
 |        t1 = Tree('(((((A,B)C)D,E)F,G)H,(I,J)K)root;', format=1)
 |        t1.prune(['A', 'B'])
 |      
 |      
 |        #                /-A
 |        #          /D /C|
 |        #       /F|      \-B
 |        #      |  |
 |        #    /H|   \-E
 |        #   |  |                        /-A
 |        #-root  \-G                 -root
 |        #   |                           \-B
 |        #   |   /-I
 |        #    \K|
 |        #       \-J
 |      
 |      
 |      
 |        t1 = Tree('(((((A,B)C)D,E)F,G)H,(I,J)K)root;', format=1)
 |        t1.prune(['A', 'B', 'C'])
 |      
 |        #                /-A
 |        #          /D /C|
 |        #       /F|      \-B
 |        #      |  |
 |        #    /H|   \-E
 |        #   |  |                              /-A
 |        #-root  \-G                  -root- C|
 |        #   |                                 \-B
 |        #   |   /-I
 |        #    \K|
 |        #       \-J
 |      
 |      
 |      
 |        t1 = Tree('(((((A,B)C)D,E)F,G)H,(I,J)K)root;', format=1)
 |        t1.prune(['A', 'B', 'I'])
 |      
 |      
 |        #                /-A
 |        #          /D /C|
 |        #       /F|      \-B
 |        #      |  |
 |        #    /H|   \-E                    /-I
 |        #   |  |                      -root
 |        #-root  \-G                      |   /-A
 |        #   |                             \C|
 |        #   |   /-I                          \-B
 |        #    \K|
 |        #       \-J
 |      
 |        t1 = Tree('(((((A,B)C)D,E)F,G)H,(I,J)K)root;', format=1)
 |        t1.prune(['A', 'B', 'F', 'H'])
 |      
 |        #                /-A
 |        #          /D /C|
 |        #       /F|      \-B
 |        #      |  |
 |        #    /H|   \-E
 |        #   |  |                              /-A
 |        #-root  \-G                -root-H /F|
 |        #   |                                 \-B
 |        #   |   /-I
 |        #    \K|
 |        #       \-J
 |  
 |  remove_child(self, child)
 |      Removes a child from this node (parent and child
 |      nodes still exit but are no longer connected).
 |  
 |  remove_sister(self, sister=None)
 |      Removes a sister node. It has the same effect as
 |      **`TreeNode.up.remove_child(sister)`**
 |      
 |      If a sister node is not supplied, the first sister will be deleted
 |      and returned.
 |      
 |      :argument sister: A node instance
 |      
 |      :return: The node removed
 |  
 |  render(self, file_name, layout=None, w=None, h=None, tree_style=None, units='px', dpi=90)
 |      Renders the node structure as an image.
 |      
 |      :var file_name: path to the output image file. valid
 |        extensions are .SVG, .PDF, .PNG
 |      
 |      :var layout: a layout function or a valid layout function name
 |      
 |      :var tree_style: a `TreeStyle` instance containing the image
 |        properties
 |      
 |      :var px units: "px": pixels, "mm": millimeters, "in": inches
 |      :var None h: height of the image in :attr:`units`
 |      :var None w: width of the image in :attr:`units`
 |      :var 300 dpi: dots per inches.
 |  
 |  resolve_polytomy(self, default_dist=0.0, default_support=0.0, recursive=True)
 |      .. versionadded: 2.2
 |      
 |      Resolve all polytomies under current node by creating an
 |      arbitrary dicotomic structure among the affected nodes. This
 |      function randomly modifies current tree topology and should
 |      only be used for compatibility reasons (i.e. programs
 |      rejecting multifurcated node in the newick representation).
 |      
 |      :param 0.0 default_dist: artificial branch distance of new
 |          nodes.
 |      
 |      :param 0.0 default_support: artificial branch support of new
 |          nodes.
 |      
 |      :param True recursive: Resolve any polytomy under this
 |           node. When False, only current node will be checked and fixed.
 |  
 |  robinson_foulds(self, t2, attr_t1='name', attr_t2='name', unrooted_trees=False, expand_polytomies=False, polytomy_size_limit=5, skip_large_polytomies=False, correct_by_polytomy_size=False, min_support_t1=0.0, min_support_t2=0.0)
 |      .. versionadded: 2.2
 |      
 |      Returns the Robinson-Foulds symmetric distance between current
 |      tree and a different tree instance.
 |      
 |      :param t2: reference tree
 |      
 |      :param name attr_t1: Compare trees using a custom node
 |                            attribute as a node name.
 |      
 |      :param name attr_t2: Compare trees using a custom node
 |                            attribute as a node name in target tree.
 |      
 |      :param False attr_t2: If True, consider trees as unrooted.
 |      
 |      :param False expand_polytomies: If True, all polytomies in the reference
 |         and target tree will be expanded into all possible binary
 |         trees. Robinson-foulds distance will be calculated between all
 |         tree combinations and the minimum value will be returned.
 |         See also, :func:`NodeTree.expand_polytomy`.
 |      
 |      :returns: (rf, rf_max, common_attrs, names, edges_t1, edges_t2,  discarded_edges_t1, discarded_edges_t2)
 |  
 |  search_nodes(self, **conditions)
 |      Returns the list of nodes matching a given set of conditions.
 |      
 |      **Example:**
 |      
 |      ::
 |      
 |        tree.search_nodes(dist=0.0, name="human")
 |  
 |  set_outgroup(self, outgroup)
 |      Sets a descendant node as the outgroup of a tree.  This function
 |      can be used to root a tree or even an internal node.
 |      
 |      :argument outgroup: a node instance within the same tree
 |        structure that will be used as a basal node.
 |  
 |  set_style(self, node_style)
 |      .. versionadded: 2.1
 |      
 |      Set 'node_style' as the fixed style for the current node.
 |  
 |  show(self, layout=None, tree_style=None, name='ETE')
 |      Starts an interactive session to visualize current node
 |      structure using provided layout and TreeStyle.
 |  
 |  sort_descendants(self, attr='name')
 |      .. versionadded: 2.1
 |      
 |      Sort the branches of a given tree by node names. After the
 |      tree is sorted, nodes are labeled in ascending order. This
 |      can be used to ensure that nodes in a tree with the same node
 |      names are always labeled in the same way. Note that if
 |      duplicated names are present, extra criteria should be added
 |      to sort nodes.
 |      
 |      Unique id is stored as a node._nid attribute
 |  
 |  standardize(self, delete_orphan=True, preserve_branch_length=True)
 |      .. versionadded:: 2.3
 |      
 |      process current tree structure to produce a standardized topology: nodes
 |      with only one child are removed and multifurcations are automatically resolved.
 |  
 |  swap_children(self)
 |      Swaps current children order.
 |  
 |  traverse(self, strategy='levelorder', is_leaf_fn=None)
 |      Returns an iterator to traverse the tree structure under this
 |      node.
 |      
 |      :argument "levelorder" strategy: set the way in which tree
 |         will be traversed. Possible values are: "preorder" (first
 |         parent and then children) 'postorder' (first children and
 |         the parent) and "levelorder" (nodes are visited in order
 |         from root to leaves)
 |      
 |      :argument None is_leaf_fn: If supplied, ``is_leaf_fn``
 |         function will be used to interrogate nodes about if they
 |         are terminal or internal. ``is_leaf_fn`` function should
 |         receive a node instance as first argument and return True
 |         or False. Use this argument to traverse a tree by
 |         dynamically collapsing internal nodes matching
 |         ``is_leaf_fn``.
 |  
 |  unroot(self)
 |      Unroots current node. This function is expected to be used on
 |      the absolute tree root node, but it can be also be applied to
 |      any other internal node. It will convert a split into a
 |      multifurcation.
 |  
 |  write(self, features=None, outfile=None, format=0, is_leaf_fn=None, format_root_node=False, dist_formatter=None, support_formatter=None, name_formatter=None, quoted_node_names=False)
 |      Returns the newick representation of current node. Several
 |      arguments control the way in which extra data is shown for
 |      every node:
 |      
 |      :argument features: a list of feature names to be exported
 |        using the Extended Newick Format (i.e. features=["name",
 |        "dist"]). Use an empty list to export all available features
 |        in each node (features=[])
 |      
 |      :argument outfile: writes the output to a given file
 |      
 |      :argument format: defines the newick standard used to encode the
 |        tree. See tutorial for details.
 |      
 |      :argument False format_root_node: If True, it allows features
 |        and branch information from root node to be exported as a
 |        part of the newick text string. For newick compatibility
 |        reasons, this is False by default.
 |      
 |      :argument is_leaf_fn: See :func:`TreeNode.traverse` for
 |        documentation.
 |      
 |      **Example:**
 |      
 |      ::
 |      
 |           t.get_newick(features=["species","name"], format=1)
 |  
 |  ----------------------------------------------------------------------
 |  Static methods defined here:
 |  
 |  from_parent_child_table(parent_child_table)
 |      Converts a parent-child table into an ETE Tree instance. 
 |      
 |      :argument parent_child_table: a list of tuples containing parent-child
 |         relationships. For example: [("A", "B", 0.1), ("A", "C", 0.2), ("C",
 |         "D", 1), ("C", "E", 1.5)]. Where each tuple represents: [parent, child,
 |         child-parent-dist]
 |      
 |      :returns: A new Tree instance 
 |      
 |      :example:
 |      
 |      >>> tree = Tree.from_parent_child_table([("A", "B", 0.1), ("A", "C", 0.2), ("C", "D", 1), ("C", "E", 1.5)])
 |      >>> print tree
 |  
 |  from_skbio(skbio_tree, map_attributes=None)
 |      Converts a scikit-bio TreeNode object into ETE Tree object. 
 |      
 |      :argument skbio_tree: a scikit bio TreeNode instance 
 |      
 |      :argument None map_attributes: A list of attribute nanes in the
 |         scikit-bio tree that should be mapped into the ETE tree
 |         instance. (name, id and branch length are always mapped)
 |      
 |      :returns: A new Tree instance 
 |      
 |      :example:
 |      
 |      >>> tree = Tree.from_skibio(skbioTree, map_attributes=["value"])
 |  
 |  ----------------------------------------------------------------------
 |  Data descriptors defined here:
 |  
 |  __dict__
 |      dictionary for instance variables (if defined)
 |  
 |  __weakref__
 |      list of weak references to the object (if defined)
 |  
 |  children
 |  
 |  dist
 |  
 |  faces
 |  
 |  img_style
 |  
 |  support
 |  
 |  up

In [18]:
ts = ete3.TreeStyle()
ts.show_leaf_name = True
ts.scale = 250
ts.branch_vertical_margin = 15

We can apply this TreeStyle to a random tree as follows. Any changes that you make to the TreeStyle above will impact this tree and all following trees in this chapter. Experiment with this - it's fun!

In [19]:
# Make a tree object
t = ete3.Tree()

# Generates a random tree of 10
t.populate(10)

# Plot the tree
t.render("%%inline", tree_style=ts)
Out[19]:

2.4.6 Distance-based approaches to phylogenetic reconstruction [edit]

The next approaches we'll take for phylogenetic reconstruction rely on computing distances between sequences. We've previously discussed distances between sequences in a few places in the text. We'll begin this section by formalizing the term distance, and introducing the concept of a distance matrix.

2.4.6.1 Distances and distance matrices [edit]

Generally, we can consider a distance between a pair of objects is a measure of their dissimilarity.

  • There isn't one single definition of the distance between two objects.
  • For example, if your two objects are the cities Flagstaff, Arizona and Boulder, Colorado you might measure the distance between them as length of shortest line that connects them. This would be the Euclidean distance between the two cities.
  • If you're trying to travel from one city to another however, that might not be the most relevant distance measure. Instead, you might be interested in something like their Manhatten distance, which in this would be more similar to a measure of the shortest route that you could travel between the two cities using the Interstate Highway system. You can see how this is different than the shortest line connecting the two cities.

image.png

An illustration comparing the taxicab metric to the Euclidean metric on the plane:

  • According to the taxicab metric all three pictured paths (red, yellow, and blue) have the same length (12).
  • According to the Euclidean metric, the green path has length and the unique shortest path is $$6\sqrt{2} = 8.49$$

Similarly, different distance metrics will be relevant and not relevant for different types of objects. Clearly the way you measure distances between cities is very different than the way we measure distances between biological sequences. However, the underlying concept of a distance between two objects is the same.

https://en.wikipedia.org/wiki/Metric_(mathematics)

https://en.wikipedia.org/wiki/Metric_space

  • In mathematics, a metric or distance function is a function that defines a distance between each pair of elements of a set.
  • A set with a metric is called a metric space.

https://en.wikipedia.org/wiki/Similarity_measure

  • In statistics and related fields, a similarity measure or similarity function is a real-valued function that quantifies the similarity between two objects.
  • Although no single definition of a similarity measure exists, usually such measures are in some sense the inverse of distance metrics:
    • they take on large values for similar objects and either zero or a negative value for very dissimilar objects.

Formally, a measure of dissimilarity $d$ between two objects $x$ and $y$ is a distance if it meets these four criteria for all $x$ and $y$:

  1. $d(x,y) \geq 0$ (non-negativity)
  2. $d(x,y) = 0\ if\ and\ only\ if\ x = y$ (identity of indiscernibles)
  3. $d(x,y) = d(y,x)$ (symmetry)
  4. $d(x,z) \leq d(x,y) + d(y,z)$ (triangle inequality)

When we compute the distances between some number of objects $n$, we'll commonly represent those values in a distance matrix which contains all of those values.

image.png

Note the tree on the axes

These distance matrices are so common in bioinformatics that scikit-bio defines a DistanceMatrix object that provides a convenient interface for working with these data.

In [20]:
from skbio import DistanceMatrix
from skbio.stats.distance._base import DissimilarityMatrix
#help(DistanceMatrix)
help(DissimilarityMatrix)
Help on class DissimilarityMatrix in module skbio.stats.distance._base:

class DissimilarityMatrix(skbio._base.SkbioObject)
 |  Store dissimilarities between objects.
 |  
 |  A `DissimilarityMatrix` instance stores a square, hollow, two-dimensional
 |  matrix of dissimilarities between objects. Objects could be, for example,
 |  samples or DNA sequences. A sequence of IDs accompanies the
 |  dissimilarities.
 |  
 |  Methods are provided to load and save dissimilarity matrices from/to disk,
 |  as well as perform common operations such as extracting dissimilarities
 |  based on object ID.
 |  
 |  Parameters
 |  ----------
 |  data : array_like or DissimilarityMatrix
 |      Square, hollow, two-dimensional ``numpy.ndarray`` of dissimilarities
 |      (floats), or a structure that can be converted to a ``numpy.ndarray``
 |      using ``numpy.asarray`` or a one-dimensional vector of dissimilarities
 |      (floats), as defined by `scipy.spatial.distance.squareform`. Can
 |      instead be a `DissimilarityMatrix` (or subclass) instance,
 |      in which case the instance's data will be used.
 |      Data will be converted to a float ``dtype`` if necessary. A copy will
 |      *not* be made if already a ``numpy.ndarray`` with a float ``dtype``.
 |  ids : sequence of str, optional
 |      Sequence of strings to be used as object IDs. Must match the number of
 |      rows/cols in `data`. If ``None`` (the default), IDs will be
 |      monotonically-increasing integers cast as strings, with numbering
 |      starting from zero, e.g., ``('0', '1', '2', '3', ...)``.
 |  
 |  See Also
 |  --------
 |  DistanceMatrix
 |  scipy.spatial.distance.squareform
 |  
 |  Notes
 |  -----
 |  The dissimilarities are stored in redundant (square-form) format [1]_.
 |  
 |  The data are not checked for symmetry, nor guaranteed/assumed to be
 |  symmetric.
 |  
 |  References
 |  ----------
 |  .. [1] http://docs.scipy.org/doc/scipy/reference/spatial.distance.html
 |  
 |  Method resolution order:
 |      DissimilarityMatrix
 |      skbio._base.SkbioObject
 |      builtins.object
 |  
 |  Methods defined here:
 |  
 |  __contains__(self, lookup_id)
 |      Check if the specified ID is in the dissimilarity matrix.
 |      
 |      State: Experimental as of 0.4.0.
 |      
 |      Parameters
 |      ----------
 |      lookup_id : str
 |          ID to search for.
 |      
 |      Returns
 |      -------
 |      bool
 |          ``True`` if `lookup_id` is in the dissimilarity matrix, ``False``
 |          otherwise.
 |      
 |      See Also
 |      --------
 |      index
 |  
 |  __eq__(self, other)
 |      Compare this dissimilarity matrix to another for equality.
 |      
 |      State: Experimental as of 0.4.0.
 |      
 |      Two dissimilarity matrices are equal if they have the same shape, IDs
 |      (in the same order!), and have data arrays that are equal.
 |      
 |      Checks are *not* performed to ensure that `other` is a
 |      `DissimilarityMatrix` instance.
 |      
 |      Parameters
 |      ----------
 |      other : DissimilarityMatrix
 |          Dissimilarity matrix to compare to for equality.
 |      
 |      Returns
 |      -------
 |      bool
 |          ``True`` if `self` is equal to `other`, ``False`` otherwise.
 |  
 |  __getitem__(self, index)
 |      Slice into dissimilarity data by object ID or numpy indexing.
 |      
 |      State: Experimental as of 0.4.0.
 |      
 |      Extracts data from the dissimilarity matrix by object ID, a pair of
 |      IDs, or numpy indexing/slicing.
 |      
 |      Parameters
 |      ----------
 |      index : str, two-tuple of str, or numpy index
 |          `index` can be one of the following forms: an ID, a pair of IDs, or
 |          a numpy index.
 |      
 |          If `index` is a string, it is assumed to be an ID and a
 |          ``numpy.ndarray`` row vector is returned for the corresponding ID.
 |          Note that the ID's row of dissimilarities is returned, *not* its
 |          column. If the matrix is symmetric, the two will be identical, but
 |          this makes a difference if the matrix is asymmetric.
 |      
 |          If `index` is a two-tuple of strings, each string is assumed to be
 |          an ID and the corresponding matrix element is returned that
 |          represents the dissimilarity between the two IDs. Note that the
 |          order of lookup by ID pair matters if the matrix is asymmetric: the
 |          first ID will be used to look up the row, and the second ID will be
 |          used to look up the column. Thus, ``dm['a', 'b']`` may not be the
 |          same as ``dm['b', 'a']`` if the matrix is asymmetric.
 |      
 |          Otherwise, `index` will be passed through to
 |          ``DissimilarityMatrix.data.__getitem__``, allowing for standard
 |          indexing of a ``numpy.ndarray`` (e.g., slicing).
 |      
 |      Returns
 |      -------
 |      ndarray or scalar
 |          Indexed data, where return type depends on the form of `index` (see
 |          description of `index` for more details).
 |      
 |      Raises
 |      ------
 |      MissingIDError
 |          If the ID(s) specified in `index` are not in the dissimilarity
 |          matrix.
 |      
 |      Notes
 |      -----
 |      The lookup based on ID(s) is quick.
 |  
 |  __init__(self, data, ids=None)
 |      State: Experimental as of 0.4.0.
 |  
 |  __ne__(self, other)
 |      Determine whether two dissimilarity matrices are not equal.
 |      
 |      State: Experimental as of 0.4.0.
 |      
 |      Parameters
 |      ----------
 |      other : DissimilarityMatrix
 |          Dissimilarity matrix to compare to.
 |      
 |      Returns
 |      -------
 |      bool
 |          ``True`` if `self` is not equal to `other`, ``False`` otherwise.
 |      
 |      See Also
 |      --------
 |      __eq__
 |  
 |  __str__(self)
 |      Return a string representation of the dissimilarity matrix.
 |      
 |      State: Experimental as of 0.4.0.
 |      
 |      Summary includes matrix dimensions, a (truncated) list of IDs, and
 |      (truncated) array of dissimilarities.
 |      
 |      Returns
 |      -------
 |      str
 |          String representation of the dissimilarity matrix.
 |  
 |  copy(self)
 |      Return a deep copy of the dissimilarity matrix.
 |      
 |      State: Experimental as of 0.4.0.
 |      
 |      Returns
 |      -------
 |      DissimilarityMatrix
 |          Deep copy of the dissimilarity matrix. Will be the same type as
 |          `self`.
 |  
 |  filter(self, ids, strict=True)
 |      Filter the dissimilarity matrix by IDs.
 |      
 |      State: Experimental as of 0.4.0.
 |      
 |      Parameters
 |      ----------
 |      ids : iterable of str
 |          IDs to retain. May not contain duplicates or be empty. Each ID must
 |          be present in the dissimilarity matrix.
 |      strict : bool, optional
 |          If `strict` is ``True`` and an ID that is not found in the distance
 |          matrix is found in `ids`, a ``MissingIDError`` exception will be
 |          raised, otherwise the ID will be ignored.
 |      
 |      Returns
 |      -------
 |      DissimilarityMatrix
 |          Filtered dissimilarity matrix containing only the IDs specified in
 |          `ids`. IDs will be in the same order as they appear in `ids`.
 |      
 |      Raises
 |      ------
 |      MissingIDError
 |          If an ID in `ids` is not in the object's list of IDs.
 |  
 |  index(self, lookup_id)
 |      Return the index of the specified ID.
 |      
 |      State: Experimental as of 0.4.0.
 |      
 |      Parameters
 |      ----------
 |      lookup_id : str
 |          ID whose index will be returned.
 |      
 |      Returns
 |      -------
 |      int
 |          Row/column index of `lookup_id`.
 |      
 |      Raises
 |      ------
 |      MissingIDError
 |          If `lookup_id` is not in the dissimilarity matrix.
 |  
 |  plot(self, cmap=None, title='')
 |      Creates a heatmap of the dissimilarity matrix
 |      
 |      State: Experimental as of 0.4.0.
 |      
 |      Parameters
 |      ----------
 |      cmap: str or matplotlib.colors.Colormap, optional
 |          Sets the color scheme of the heatmap
 |          If ``None``, defaults to the colormap specified in the matplotlib
 |          rc file.
 |      
 |      title: str, optional
 |          Sets the title label of the heatmap
 |          (Default is blank)
 |      
 |      Returns
 |      -------
 |      matplotlib.figure.Figure
 |          Figure containing the heatmap and colorbar of the plotted
 |          dissimilarity matrix.
 |      
 |      Examples
 |      --------
 |      .. plot::
 |      
 |         Define a dissimilarity matrix with five objects labeled A-E:
 |      
 |         >>> from skbio.stats.distance import DissimilarityMatrix
 |         >>> dm = DissimilarityMatrix([[0, 1, 2, 3, 4], [1, 0, 1, 2, 3],
 |         ...                           [2, 1, 0, 1, 2], [3, 2, 1, 0, 1],
 |         ...                           [4, 3, 2, 1, 0]],
 |         ...                          ['A', 'B', 'C', 'D', 'E'])
 |      
 |         Plot the dissimilarity matrix as a heatmap:
 |      
 |         >>> fig = dm.plot(cmap='Reds', title='Example heatmap')
 |  
 |  redundant_form(self)
 |      Return an array of dissimilarities in redundant format.
 |      
 |      State: Experimental as of 0.4.0.
 |      
 |      As this is the native format that the dissimilarities are stored in,
 |      this is simply an alias for `data`.
 |      
 |      Returns
 |      -------
 |      ndarray
 |          Two-dimensional ``numpy.ndarray`` of dissimilarities in redundant
 |          format.
 |      
 |      Notes
 |      -----
 |      Redundant format is described in [1]_.
 |      
 |      Does *not* return a copy of the data.
 |      
 |      References
 |      ----------
 |      .. [1] http://docs.scipy.org/doc/scipy/reference/spatial.distance.html
 |  
 |  to_data_frame(self)
 |      Create a ``pandas.DataFrame`` from this ``DissimilarityMatrix``.
 |      
 |      State: Experimental as of 0.4.1.
 |      
 |      Returns
 |      -------
 |      pd.DataFrame
 |          ``pd.DataFrame`` with IDs on index and columns.
 |      
 |      Examples
 |      --------
 |      >>> from skbio import DistanceMatrix
 |      >>> dm = DistanceMatrix([[0, 1, 2],
 |      ...                      [1, 0, 3],
 |      ...                      [2, 3, 0]], ids=['a', 'b', 'c'])
 |      >>> df = dm.to_data_frame()
 |      >>> df
 |         a  b  c
 |      a  0  1  2
 |      b  1  0  3
 |      c  2  3  0
 |  
 |  transpose(self)
 |      Return the transpose of the dissimilarity matrix.
 |      
 |      State: Experimental as of 0.4.0.
 |      
 |      Notes
 |      -----
 |      A deep copy is returned.
 |      
 |      Returns
 |      -------
 |      DissimilarityMatrix
 |          Transpose of the dissimilarity matrix. Will be the same type as
 |          `self`.
 |  
 |  write(self, file, format='lsmat', **kwargs)
 |      Write an instance of ``DissimilarityMatrix`` to a file.
 |      
 |      This is a convenience method for :func:`skbio.io.registry.write`.
 |      For more information about the I/O system in scikit-bio, please
 |      see :mod:`skbio.io`.
 |      
 |      Supported file formats include:
 |      
 |      - ``'lsmat'`` (:mod:`skbio.io.format.lsmat`)
 |      
 |      Parameters
 |      ----------
 |      file : openable (filepath, URL, filehandle, etc.)
 |          The location to write the given `format` into.  Something
 |          that is understood by :func:`skbio.io.util.open`. Filehandles
 |          are not automatically closed, it is the responsibility of the
 |          caller.
 |      format : str
 |          The format must be a registered format name with a writer for
 |          ``DissimilarityMatrix``.
 |          Default is `'lsmat'`.
 |      kwargs : dict, optional
 |          Keyword arguments passed to :func:`skbio.io.registry.write`
 |          and the file format writer.
 |      
 |      See Also
 |      --------
 |      read
 |      skbio.io.registry.write
 |      skbio.io.util.open
 |      skbio.io.format.lsmat
 |  
 |  ----------------------------------------------------------------------
 |  Class methods defined here:
 |  
 |  read(cls, file, format=None, **kwargs)
 |      Create a new ``DissimilarityMatrix`` instance from a file.
 |      
 |      This is a convenience method for :func:`skbio.io.registry.read`. For
 |      more information about the I/O system in scikit-bio, please see
 |      :mod:`skbio.io`.
 |      
 |      Supported file formats include:
 |      
 |      - ``'lsmat'`` (:mod:`skbio.io.format.lsmat`)
 |      
 |      Parameters
 |      ----------
 |      file : openable (filepath, URL, filehandle, etc.)
 |          The location to read the given `format`. Something that is
 |          understood by :func:`skbio.io.util.open`. Filehandles are not
 |          automatically closed, it is the responsibility of the caller.
 |      format : str, optional
 |          The format must be a format name with a reader for ``DissimilarityMatrix``.
 |          If a `format` is not provided or is None, it will attempt to
 |          guess the format.
 |      kwargs : dict, optional
 |          Keyword arguments passed to :func:`skbio.io.registry.read` and
 |          the file format reader for ``DissimilarityMatrix``.
 |      
 |      Returns
 |      -------
 |      DissimilarityMatrix
 |          A new instance.
 |      
 |      See Also
 |      --------
 |      write
 |      skbio.io.registry.read
 |      skbio.io.util.open
 |      skbio.io.format.lsmat
 |  
 |  ----------------------------------------------------------------------
 |  Data descriptors defined here:
 |  
 |  T
 |      Transpose of the dissimilarity matrix.
 |      
 |      State: Experimental as of 0.4.0.
 |      
 |      See Also
 |      --------
 |      transpose
 |  
 |  data
 |      Array of dissimilarities.
 |      
 |      State: Experimental as of 0.4.0.
 |      
 |      A square, hollow, two-dimensional ``numpy.ndarray`` of dissimilarities
 |      (floats). A copy is *not* returned.
 |      
 |      Notes
 |      -----
 |      This property is not writeable.
 |  
 |  dtype
 |      Data type of the dissimilarities.
 |      
 |      State: Experimental as of 0.4.0.
 |  
 |  ids
 |      Tuple of object IDs.
 |      
 |      State: Experimental as of 0.4.0.
 |      
 |      A tuple of strings, one for each object in the dissimilarity matrix.
 |      
 |      Notes
 |      -----
 |      This property is writeable, but the number of new IDs must match the
 |      number of objects in `data`.
 |  
 |  png
 |      Display heatmap in IPython Notebook as PNG.
 |      
 |      State: Experimental as of 0.4.0.
 |  
 |  shape
 |      Two-element tuple containing the dissimilarity matrix dimensions.
 |      
 |      State: Experimental as of 0.4.0.
 |      
 |      Notes
 |      -----
 |      As the dissimilarity matrix is guaranteed to be square, both tuple
 |      entries will always be equal.
 |  
 |  size
 |      Total number of elements in the dissimilarity matrix.
 |      
 |      State: Experimental as of 0.4.0.
 |      
 |      Notes
 |      -----
 |      Equivalent to ``self.shape[0] * self.shape[1]``.
 |  
 |  svg
 |      Display heatmap in IPython Notebook as SVG.
 |      
 |      State: Experimental as of 0.4.0.
 |  
 |  ----------------------------------------------------------------------
 |  Data and other attributes defined here:
 |  
 |  __abstractmethods__ = frozenset()
 |  
 |  __hash__ = None
 |  
 |  default_write_format = 'lsmat'
 |  
 |  ----------------------------------------------------------------------
 |  Data descriptors inherited from skbio._base.SkbioObject:
 |  
 |  __dict__
 |      dictionary for instance variables (if defined)
 |  
 |  __weakref__
 |      list of weak references to the object (if defined)

We can create one of these objects as follows:

In [21]:
dm = DistanceMatrix([[0.0, 1.0, 2.0],
                     [1.0, 0.0, 3.0],
                     [2.0, 3.0, 0.0]],
                    ids=['a', 'b', 'c'])

We can then access the values in the distance matrix directly, view the distance matrix as a heatmap, and do many other things that facilitate analyzing distances between objects.

In [22]:
print(dm['a', 'b'])
print(dm['b', 'c'])
1.0
3.0
In [23]:
help(dm.plot)
Help on method plot in module skbio.stats.distance._base:

plot(cmap=None, title='') method of skbio.stats.distance._base.DistanceMatrix instance
    Creates a heatmap of the dissimilarity matrix
    
    State: Experimental as of 0.4.0.
    
    Parameters
    ----------
    cmap: str or matplotlib.colors.Colormap, optional
        Sets the color scheme of the heatmap
        If ``None``, defaults to the colormap specified in the matplotlib
        rc file.
    
    title: str, optional
        Sets the title label of the heatmap
        (Default is blank)
    
    Returns
    -------
    matplotlib.figure.Figure
        Figure containing the heatmap and colorbar of the plotted
        dissimilarity matrix.
    
    Examples
    --------
    .. plot::
    
       Define a dissimilarity matrix with five objects labeled A-E:
    
       >>> from skbio.stats.distance import DissimilarityMatrix
       >>> dm = DissimilarityMatrix([[0, 1, 2, 3, 4], [1, 0, 1, 2, 3],
       ...                           [2, 1, 0, 1, 2], [3, 2, 1, 0, 1],
       ...                           [4, 3, 2, 1, 0]],
       ...                          ['A', 'B', 'C', 'D', 'E'])
    
       Plot the dissimilarity matrix as a heatmap:
    
       >>> fig = dm.plot(cmap='Reds', title='Example heatmap')

In [24]:
_ = dm.plot(cmap='Greens')

The conditions of a distance metric listed above lead to a few specific features of distance matrices:

  • they're symmetric (if you flip the upper triangle over the diagonal, the values are the same as those in the lower triangle),
  • hollow (the diagonal is all zeros), and
  • all values are greater than or equal to zero.

Which of the conditions listed above results in each of these features of distance matrices?

2.4.6.2 Alignment-free distances between sequences [edit]

We've now looked at several ways of computing distances between sequences, some of which have required that the positions in the sequences are directly comparable to one another (i.e., that our sequences are aligned), and some of which haven't.

One alignment-free distance between sequences is the k-mer distance that we worked with in Sequence Homology Searching.

We can use the kmer_distance function with scikit-bio as follows to create an skbio.DistanceMatrix object.

In [25]:
from iab.algorithms import kmer_distance
%psource kmer_distance
In [26]:
help(DistanceMatrix.from_iterable)
Help on function from_iterable in module skbio.stats.distance._base:

from_iterable(cls, iterable, metric, key=None, keys=None)
    Create DistanceMatrix from all pairs in an iterable given a metric.
    
    State: Experimental as of 0.4.1.
    
    Parameters
    ----------
    iterable : iterable
        Iterable containing objects to compute pairwise distances on.
    metric : callable
        A function that takes two arguments and returns a float
        representing the distance between the two arguments.
    key : callable or metadata key, optional
        A function that takes one argument and returns a string
        representing the id of the element in the distance matrix.
        Alternatively, a key to a `metadata` property if it exists for
        each element in the `iterable`. If None, then default ids will be
        used.
    keys : iterable, optional
        An iterable of the same length as `iterable`. Each element will be
        used as the respective key.
    
    Returns
    -------
    DistanceMatrix
        The `metric` applied to all pairwise elements in the `iterable`.
    
    Raises
    ------
    ValueError
        If `key` and `keys` are both provided.
    
    Notes
    -----
    Symmetry and hollowness are assumed when calculating the distances via
    `metric`. Therefore, distances are only computed for the strictly
    upper/lower triangle.

In [27]:
print(sequences)
kmer_dm = DistanceMatrix.from_iterable(sequences, 
                                       metric=kmer_distance, 
                                       key='id')
print(kmer_dm)
_ = kmer_dm.plot(cmap='Greens', 
                 title='3mer distances between sequences')
[DNA
--------------------------------------------------------
Metadata:
    'id': '0.2.1.1.1.2.1.2.2.1.2'
Stats:
    length: 50
    has gaps: False
    has degenerates: False
    has non-degenerates: True
    GC-content: 58.00%
--------------------------------------------------------
0 CCTCCTCAAG GATGCGCTGT CTCGGAGGTA GGTCTCTACG TCCTCTGATG, DNA
--------------------------------------------------------
Metadata:
    'id': '0.1.2.2.1.1.1.2.2.1.1'
Stats:
    length: 50
    has gaps: False
    has degenerates: False
    has non-degenerates: True
    GC-content: 38.00%
--------------------------------------------------------
0 ACTCTTTAAG TATGGTTCCT CTCCGAGATA GGTCTTTAGG TACTTATTTC, DNA
--------------------------------------------------------
Metadata:
    'id': '0.1.1.1.2.1.1.2.2.2.2'
Stats:
    length: 50
    has gaps: False
    has degenerates: False
    has non-degenerates: True
    GC-content: 48.00%
--------------------------------------------------------
0 ACTCTTTAAG CATGGTCTGT CTCCGATGCA GGTGTCTGCA TTCTCAGTTG, DNA
--------------------------------------------------------
Metadata:
    'id': '0.2.2.2.2.2.2.1.2.1.2'
Stats:
    length: 50
    has gaps: False
    has degenerates: False
    has non-degenerates: True
    GC-content: 52.00%
--------------------------------------------------------
0 ACACTTTAAG TATGGGCTGT CTGCGATGTC CCTCTGCGCG TCCTTAGAGT, DNA
--------------------------------------------------------
Metadata:
    'id': '0.2.2.1.1.2.2.1.2.1.1'
Stats:
    length: 50
    has gaps: False
    has degenerates: False
    has non-degenerates: True
    GC-content: 38.00%
--------------------------------------------------------
0 AGTCTTTAAG TATGTTCAGT ATGCGATCTG GGTCTCTGAG TTCATAGATT, DNA
--------------------------------------------------------
Metadata:
    'id': '0.1.2.1.1.1.1.2.2.2.1'
Stats:
    length: 50
    has gaps: False
    has degenerates: False
    has non-degenerates: True
    GC-content: 42.00%
--------------------------------------------------------
0 ATAATTTAAG AAAGGACGGT GTCAGATGTA GGGCGCTATG CTCTTACTTG, DNA
--------------------------------------------------------
Metadata:
    'id': '0.2.1.1.2.2.1.2.2.2.1'
Stats:
    length: 50
    has gaps: False
    has degenerates: False
    has non-degenerates: True
    GC-content: 54.00%
--------------------------------------------------------
0 CCCCTACAAG TATGGTCTGT AGCCGAGGTG TGTCTCTACG TCCTCAGATC, DNA
--------------------------------------------------------
Metadata:
    'id': '0.2.2.2.1.2.1.1.2.1.1'
Stats:
    length: 50
    has gaps: False
    has degenerates: False
    has non-degenerates: True
    GC-content: 44.00%
--------------------------------------------------------
0 AATCGTAAAG TACGGGTCGT CTGCGATGTA GGTCTTTGCT TTCTTAGATG, DNA
--------------------------------------------------------
Metadata:
    'id': '0.2.1.1.2.1.2.2.1.1.2'
Stats:
    length: 50
    has gaps: False
    has degenerates: False
    has non-degenerates: True
    GC-content: 46.00%
--------------------------------------------------------
0 ACCCTTCAAG TATTGGCTGT CTCCGAGGTT GTTCTCTGAG TACTTAGATC, DNA
--------------------------------------------------------
Metadata:
    'id': '0.2.2.1.2.2.2.1.2.1.1'
Stats:
    length: 50
    has gaps: False
    has degenerates: False
    has non-degenerates: True
    GC-content: 42.00%
--------------------------------------------------------
0 ACTCCTTAAG TATGGGCTGT CTGAGATCTA GGTTTCTGCT TTCCTAGATT, DNA
--------------------------------------------------------
Metadata:
    'id': '0.1.1.2.2.2.2.1.2.2.1'
Stats:
    length: 50
    has gaps: False
    has degenerates: False
    has non-degenerates: True
    GC-content: 42.00%
--------------------------------------------------------
0 TCTCTTTTCG TATGGTCTGT CATCGTTGTA AGTCCATTCG TTCGTAGATG, DNA
--------------------------------------------------------
Metadata:
    'id': '0.1.1.2.1.1.2.2.1.1.1'
Stats:
    length: 50
    has gaps: False
    has degenerates: False
    has non-degenerates: True
    GC-content: 50.00%
--------------------------------------------------------
0 ACTCTTTACG CGTTGTCCGT CTCCGATCTA GGAACCTTCG TTCTCAGATG, DNA
--------------------------------------------------------
Metadata:
    'id': '0.2.1.1.1.1.2.2.2.2.1'
Stats:
    length: 50
    has gaps: False
    has degenerates: False
    has non-degenerates: True
    GC-content: 50.00%
--------------------------------------------------------
0 ACTCTTGAAG CCTGGACTGT CTCCGATGTA GGTCTCTGCA TGCTTAGATC, DNA
--------------------------------------------------------
Metadata:
    'id': '0.2.2.2.2.1.2.2.1.1.1'
Stats:
    length: 50
    has gaps: False
    has degenerates: False
    has non-degenerates: True
    GC-content: 46.00%
--------------------------------------------------------
0 ACTCTTTAAG TATCGGCTGT CTAGGATGTC GGTCGCTGCG TGCTTATATT, DNA
--------------------------------------------------------
Metadata:
    'id': '0.1.1.2.2.2.2.1.2.1.1'
Stats:
    length: 50
    has gaps: False
    has degenerates: False
    has non-degenerates: True
    GC-content: 42.00%
--------------------------------------------------------
0 TCTCTTTTCG TATGGTCTGT CATCGTTGTA AGACCATACG TTGCTAGATG, DNA
--------------------------------------------------------
Metadata:
    'id': '0.2.2.1.2.1.2.2.2.2.2'
Stats:
    length: 50
    has gaps: False
    has degenerates: False
    has non-degenerates: True
    GC-content: 46.00%
--------------------------------------------------------
0 ACTCTTAAAA TGTGGGCCGT CTGTGATCCA GGACTCTGCG TTCTTAGATT, DNA
--------------------------------------------------------
Metadata:
    'id': '0.1.1.2.2.2.2.2.2.2.2'
Stats:
    length: 50
    has gaps: False
    has degenerates: False
    has non-degenerates: True
    GC-content: 50.00%
--------------------------------------------------------
0 ACTCCTTAAG AATGGTCTGT CACCGCTGTC GGTCCATTCG TTCCCAGATA, DNA
--------------------------------------------------------
Metadata:
    'id': '0.1.1.2.2.1.2.2.2.2.1'
Stats:
    length: 50
    has gaps: False
    has degenerates: False
    has non-degenerates: True
    GC-content: 50.00%
--------------------------------------------------------
0 AGTCTTTAAG TAGGGTCTGT CACCGATGTA GGTACCTACG TTCTCAGGCG, DNA
--------------------------------------------------------
Metadata:
    'id': '0.1.2.1.1.1.2.2.2.2.2'
Stats:
    length: 50
    has gaps: False
    has degenerates: False
    has non-degenerates: True
    GC-content: 44.00%
--------------------------------------------------------
0 ACGCTTTCAT TATGGACCAA GTCCAGTCCA GGGCTCTACG TTCTTAAAAA, DNA
--------------------------------------------------------
Metadata:
    'id': '0.2.2.2.1.1.2.1.1.2.1'
Stats:
    length: 50
    has gaps: False
    has degenerates: False
    has non-degenerates: True
    GC-content: 40.00%
--------------------------------------------------------
0 ACTCTTTTTG TATGGATTGA CAGCGATGTA AAGCTCTGCG TTCTTAGATG, DNA
--------------------------------------------------------
Metadata:
    'id': '0.1.2.1.1.2.1.2.1.2.1'
Stats:
    length: 50
    has gaps: False
    has degenerates: False
    has non-degenerates: True
    GC-content: 48.00%
--------------------------------------------------------
0 TCAGCTTACG TATGGTCTGT TTCCGATGTA GCGCTCTACG CTCTTCGATA, DNA
--------------------------------------------------------
Metadata:
    'id': '0.2.2.1.2.2.1.1.2.1.1'
Stats:
    length: 50
    has gaps: False
    has degenerates: False
    has non-degenerates: True
    GC-content: 44.00%
--------------------------------------------------------
0 ACTCTCTAAG TATGGGCTGA GTTACAACAA GTTCTCTGCG TTCCTAGATC, DNA
--------------------------------------------------------
Metadata:
    'id': '0.1.1.1.1.2.2.2.1.1.2'
Stats:
    length: 50
    has gaps: False
    has degenerates: False
    has non-degenerates: True
    GC-content: 48.00%
--------------------------------------------------------
0 ATTCTTTGAG AATGGTCCGT CTCCGATGTA GGACGCTACG TCCTTAGATG, DNA
--------------------------------------------------------
Metadata:
    'id': '0.1.1.1.1.2.2.2.2.1.1'
Stats:
    length: 50
    has gaps: False
    has degenerates: False
    has non-degenerates: True
    GC-content: 46.00%
--------------------------------------------------------
0 ACTCTTTGAG CATGGTCTGT CTCCGATGTA GGACTCTACG TTCTTAGATG, DNA
--------------------------------------------------------
Metadata:
    'id': '0.2.2.2.2.2.2.2.2.1.2'
Stats:
    length: 50
    has gaps: False
    has degenerates: False
    has non-degenerates: True
    GC-content: 48.00%
--------------------------------------------------------
0 CCACACTATG TATGGACTGT CGGTGAAGTC GCTCTCTGTG TTCTTAGATG]
25x25 distance matrix
IDs:
'0.2.1.1.1.2.1.2.2.1.2', '0.1.2.2.1.1.1.2.2.1.1', '0.1.1.1.2.1.1.2.2.2.2', ...
Data:
[[ 0.          0.68181818  0.72340426  0.59090909  0.65217391  0.55319149
   0.5         0.54761905  0.58695652  0.56818182  0.65116279  0.56818182
   0.59090909  0.48780488  0.56818182  0.69387755  0.64583333  0.55813953
   0.65957447  0.68085106  0.51219512  0.55813953  0.5         0.56818182
   0.59090909]
 [ 0.68181818  0.          0.46153846  0.48780488  0.41025641  0.58333333
   0.53488372  0.58139535  0.375       0.34210526  0.47368421  0.56818182
   0.55813953  0.55813953  0.56818182  0.60869565  0.55555556  0.48780488
   0.6         0.55813953  0.51219512  0.52380952  0.46341463  0.46341463
   0.59090909]
 [ 0.72340426  0.46153846  0.          0.60869565  0.43902439  0.6
   0.58695652  0.6         0.51111111  0.48837209  0.46153846  0.58695652
   0.47619048  0.57777778  0.58695652  0.53333333  0.51111111  0.51162791
   0.58695652  0.51162791  0.53488372  0.69387755  0.61702128  0.45238095
   0.63829787]
 [ 0.59090909  0.48780488  0.60869565  0.          0.46511628  0.56
   0.54347826  0.52272727  0.5         0.36585366  0.55813953  0.63265306
   0.56521739  0.46511628  0.60416667  0.58333333  0.5625      0.5
   0.63265306  0.46511628  0.55555556  0.3902439   0.57446809  0.60416667
   0.53333333]
 [ 0.65217391  0.41025641  0.43902439  0.46511628  0.          0.53061224
   0.57446809  0.48837209  0.46666667  0.325       0.36842105  0.63265306
   0.56521739  0.5         0.44186047  0.52173913  0.5625      0.46511628
   0.60416667  0.42857143  0.48837209  0.5         0.60416667  0.54347826
   0.56521739]
 [ 0.55319149  0.58333333  0.6         0.56        0.53061224  0.
   0.64814815  0.45652174  0.61111111  0.51020408  0.64        0.54
   0.58823529  0.36363636  0.51020408  0.52        0.58490566  0.53061224
   0.51020408  0.5         0.55102041  0.64150943  0.47916667  0.59615385
   0.53061224]
 [ 0.5         0.53488372  0.58695652  0.54347826  0.57446809  0.64814815
   0.          0.625       0.44444444  0.55319149  0.56818182  0.58333333
   0.54347826  0.63265306  0.58333333  0.64705882  0.65384615  0.51111111
   0.66666667  0.66        0.5         0.54347826  0.55319149  0.52173913
   0.63265306]
 [ 0.54761905  0.58139535  0.6         0.52272727  0.48837209  0.45652174
   0.625       0.          0.58333333  0.5         0.51219512  0.56521739
   0.58695652  0.45238095  0.46511628  0.57446809  0.64        0.45238095
   0.65306122  0.52272727  0.54545455  0.58695652  0.53333333  0.56521739
   0.61702128]
 [ 0.58695652  0.375       0.51111111  0.5         0.46666667  0.61111111
   0.44444444  0.58333333  0.          0.33333333  0.48837209  0.54166667
   0.46666667  0.59183673  0.54166667  0.58        0.53061224  0.5         0.6
   0.5625      0.55319149  0.46666667  0.47826087  0.47826087  0.5625    ]
 [ 0.56818182  0.34210526  0.48837209  0.36585366  0.325       0.51020408
   0.55319149  0.5         0.33333333  0.          0.425       0.6122449
   0.47727273  0.4047619   0.48888889  0.46666667  0.57142857  0.47727273
   0.55319149  0.47727273  0.56521739  0.325       0.48888889  0.48888889
   0.47727273]
 [ 0.65116279  0.47368421  0.46153846  0.55813953  0.36842105  0.64
   0.56818182  0.51219512  0.48837209  0.425       0.          0.6
   0.59090909  0.52380952  0.2972973   0.57777778  0.48837209  0.55813953
   0.56818182  0.52380952  0.51219512  0.59090909  0.6         0.53488372
   0.52380952]
 [ 0.56818182  0.56818182  0.58695652  0.63265306  0.63265306  0.54
   0.58333333  0.56521739  0.54166667  0.6122449   0.6         0.
   0.57446809  0.60416667  0.55319149  0.5625      0.54166667  0.44186047
   0.58333333  0.60416667  0.42857143  0.63265306  0.41860465  0.45454545
   0.63265306]
 [ 0.59090909  0.55813953  0.47619048  0.56521739  0.56521739  0.58823529
   0.54347826  0.58695652  0.46666667  0.47727273  0.59090909  0.57446809
   0.          0.59574468  0.57446809  0.52173913  0.59183673  0.65306122
   0.73584906  0.5         0.61702128  0.625       0.44186047  0.36585366
   0.53333333]
 [ 0.48780488  0.55813953  0.57777778  0.46511628  0.5         0.36363636
   0.63265306  0.45238095  0.59183673  0.4047619   0.52380952  0.60416667
   0.59574468  0.          0.51111111  0.55319149  0.59183673  0.53333333
   0.60416667  0.53333333  0.52272727  0.56521739  0.60416667  0.63265306
   0.46511628]
 [ 0.56818182  0.56818182  0.58695652  0.60416667  0.44186047  0.51020408
   0.58333333  0.46511628  0.54166667  0.48888889  0.2972973   0.55319149
   0.57446809  0.51111111  0.          0.62        0.5106383   0.51111111
   0.52173913  0.51111111  0.42857143  0.57446809  0.55319149  0.48888889
   0.51111111]
 [ 0.69387755  0.60869565  0.53333333  0.58333333  0.52173913  0.52
   0.64705882  0.57446809  0.58        0.46666667  0.57777778  0.5625
   0.52173913  0.55319149  0.62        0.          0.55102041  0.58333333
   0.5625      0.45454545  0.63265306  0.55319149  0.5625      0.53191489
   0.58333333]
 [ 0.64583333  0.55555556  0.51111111  0.5625      0.5625      0.58490566
   0.65384615  0.64        0.53061224  0.57142857  0.48837209  0.54166667
   0.59183673  0.59183673  0.5106383   0.55102041  0.          0.5625
   0.57142857  0.62        0.52173913  0.67307692  0.57142857  0.62745098
   0.5       ]
 [ 0.55813953  0.48780488  0.51162791  0.5         0.46511628  0.53061224
   0.51111111  0.45238095  0.5         0.47727273  0.55813953  0.44186047
   0.65306122  0.53333333  0.51111111  0.58333333  0.5625      0.
   0.51111111  0.59574468  0.41463415  0.56521739  0.54347826  0.51111111
   0.625     ]
 [ 0.65957447  0.6         0.58695652  0.63265306  0.60416667  0.51020408
   0.66666667  0.65306122  0.6         0.55319149  0.56818182  0.58333333
   0.73584906  0.60416667  0.52173913  0.5625      0.57142857  0.51111111
   0.          0.60416667  0.56521739  0.60416667  0.58333333  0.6122449
   0.63265306]
 [ 0.68085106  0.55813953  0.51162791  0.46511628  0.42857143  0.5         0.66
   0.52272727  0.5625      0.47727273  0.52380952  0.60416667  0.5
   0.53333333  0.51111111  0.45454545  0.62        0.59574468  0.60416667
   0.          0.52272727  0.5         0.54347826  0.47727273  0.5       ]
 [ 0.51219512  0.51219512  0.53488372  0.55555556  0.48837209  0.55102041
   0.5         0.54545455  0.55319149  0.56521739  0.51219512  0.42857143
   0.61702128  0.52272727  0.42857143  0.63265306  0.52173913  0.41463415
   0.56521739  0.52272727  0.          0.61702128  0.46511628  0.42857143
   0.55555556]
 [ 0.55813953  0.52380952  0.69387755  0.3902439   0.5         0.64150943
   0.54347826  0.58695652  0.46666667  0.325       0.59090909  0.63265306
   0.625       0.56521739  0.57446809  0.55319149  0.67307692  0.56521739
   0.60416667  0.5         0.61702128  0.          0.63265306  0.60416667
   0.56521739]
 [ 0.5         0.46341463  0.61702128  0.57446809  0.60416667  0.47916667
   0.55319149  0.53333333  0.47826087  0.48888889  0.6         0.41860465
   0.44186047  0.60416667  0.55319149  0.5625      0.57142857  0.54347826
   0.58333333  0.54347826  0.46511628  0.63265306  0.          0.3
   0.54347826]
 [ 0.56818182  0.46341463  0.45238095  0.60416667  0.54347826  0.59615385
   0.52173913  0.56521739  0.47826087  0.48888889  0.53488372  0.45454545
   0.36585366  0.63265306  0.48888889  0.53191489  0.62745098  0.51111111
   0.6122449   0.47727273  0.42857143  0.60416667  0.3         0.
   0.54347826]
 [ 0.59090909  0.59090909  0.63829787  0.53333333  0.56521739  0.53061224
   0.63265306  0.61702128  0.5625      0.47727273  0.52380952  0.63265306
   0.53333333  0.46511628  0.51111111  0.58333333  0.5         0.625
   0.63265306  0.5         0.55555556  0.56521739  0.54347826  0.54347826
   0.        ]]

2.4.6.3 Alignment-based distances between sequences [edit]

One alignment-based distance metric that we've looked at is Hamming distance. This would be considered an alignment-based approach because is does consider the order of the characters in the sequence by comparing a character at a position in one sequence only to the character at the corresponding position in the other sequence. We could compute these distances as follows, after first aligning our sequences.

Note: here, we specified a 0 probability of indel mutations, with sequences of identical length, and thus they are already aligned:

In [28]:
from skbio.sequence.distance import hamming

hamming_dm = DistanceMatrix.from_iterable(sequences_aligned, 
                                          metric=hamming, 
                                          key='id')

_ = hamming_dm.plot(cmap='Greens', 
                    title='Hamming distances between sequences')

2.4.6.4 Jukes-Cantor correction of observed distances between sequences [edit]

Why question: How long should branches be? Does it matter for our conclusions?

UPGMA is "ultrametric", meaning that all the terminal nodes (i.e. the sequences/taxa) are equally distance from the root. In molecular terms, this means that UPGMA assumes a molecular clock, i.e. all lineages are evolving at a constant rate. In practical terms, this means that you can construct a distance scale bar and all the terminal nodes will be level at position 0.0, representing the present. In this example, the scale bar is shown on the right-hand side.

image.png

The Hamming distance between aligned sequences, as described above, is simple to calculate, but it is often an underestimate of the actual amount of mutation that has occurred in a sequence. Here's why:

  • imagine that in one generation $g$, position $p$ of sequence $S1$ undergoes a substitution mutation from A to C.
  • Then, in the next generation $g + 1$, the same position $p$ of sequence $S1$ undergoes a substitution from C to T.
  • Because we can only inspect the modern-day sequences, not their ancestors, it looks like position $p$ has had a single substitution event.
  • Similarly, if in generation $g + 1$ position $p$ changed from C back to A (a back substitution), we would observe zero substitution mutations at that position even though two had occurred.

To correct for this, the Jukes-Cantor correction is typically applied to the Hamming distances between the sequences.

  • Where $p$ is the Hamming distance, the corrected genetic distance is computed as $d = -\frac{3}{4} \ln(1 - \frac{4}{3}p)$.
  • The derivation of this formula is beyond the scope of this text (you can find it in Inferring Phylogeny by Felsenstein), but it is based on the Jukes-Cantor (JC69) nucleotide substitution model.

https://en.wikipedia.org/wiki/Substitution_model

The Python implementation of this correction looks like the following. We can apply this to a number of input distance values to understand how it transforms our Hamming distances.

In [29]:
def jc_correction(p):
    return (-3 / 4) * np.log(1 - (4 * p / 3))
In [30]:
distances = np.arange(0, 0.70, 0.05)

jc_corrected_distances = list(map(jc_correction, distances))

ax = sns.pointplot(distances, jc_corrected_distances)

ax.set_xlabel('Hamming distance')
ax.set_ylabel('JC-corrected distance')
ax.set_xlim(0)
ax.set_ylim(0)

ax
Out[30]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f25243e3be0>

We can then apply this to a full distance matrix as follows (we'll then print the first row of each).

In [31]:
def jc_correct_dm(dm):
    result = np.zeros(dm.shape)
    for i in range(dm.shape[0]):
        for j in range(i):
            result[i, j] = result[j, i] = jc_correction(dm[i, j])
            
    return skbio.DistanceMatrix(result, ids=dm.ids)

jc_corrected_hamming_dm = jc_correct_dm(hamming_dm)
In [32]:
print(hamming_dm[0])
print(jc_corrected_hamming_dm[0])
[ 0.    0.38  0.3   0.38  0.42  0.46  0.26  0.36  0.3   0.32  0.42  0.38
  0.3   0.32  0.44  0.38  0.34  0.34  0.52  0.42  0.36  0.42  0.3   0.26
  0.38]
[ 0.          0.52992765  0.38311922  0.52992765  0.61573541  0.71264421
  0.31925086  0.49044485  0.38311922  0.417216    0.61573541  0.52992765
  0.38311922  0.417216    0.66262568  0.52992765  0.45293704  0.45293704
  0.88649542  0.61573541  0.49044485  0.61573541  0.38311922  0.31925086
  0.52992765]
In [33]:
# Pre-correction:
_ = hamming_dm.plot(cmap='Greens', title='Hamming distances between sequences')

# Post-correction:
_ = jc_corrected_hamming_dm.plot(cmap='Greens', title='JC-corrected Hamming distances between sequences')

2.4.6.5 Phylogenetic reconstruction with UPGMA [edit]

The first algorithm we'll look at for reconstructing phylogenetic trees is called UPGMA, which stands for Unweighted Pair-Group Method with Arithmetic mean. While that name sounds complex, it's actually a straightforward algorithm, which is why we're starting with it. After we work through the algorithm, we'll come back to the name as it'll make more sense then.

UPGMA is a generic hierarchical clustering algorithm. It's not specific to reconstructing biological trees, but rather is used for interpreting any type of distance matrix. It is fairly widely used for building phylogenetic trees, though it's application in phylogenetics is usually restricted to building preliminary trees to "guide" the process of multiple sequence alignment. The reason for this is that it's fast, but it makes some assumptions that don't work well for inferring relationships between organisms, which we'll discuss after working through the algorithm.

UPGMA starts with a distance matrix, and works through the following steps to create a tree.

Step 1: Find the smallest non-zero distance in the matrix and define a clade containing only those members. Draw that clade, and set the total length of the branch connecting the tips to the distance between the tips. The distance between each tip and the node connecting them should be half of the distance between the tips.

Step 2: Create a new distance matrix with an entry representing the new clade created in step 1.

Step 3: Calculate the distance matrix entries for the new clade as the mean distance from each of the tips of the new clade to all other tips in the original distance matrix.

Step 4: If there is only one distance (below or above the diagonal) in the distance matrix, use it to connect the remaining unconnected clades, and stop. Otherwise repeat step 1.

Let's work through these steps for a simple distance matrix representing the distances between five sequences.

In [34]:
_data = np.array([[ 0.,  4.,  2.,  5.,  6.],
                  [ 4.,  0.,  3.,  6.,  5.],
                  [ 2.,  3.,  0.,  3.,  4.],
                  [ 5.,  6.,  3.,  0.,  1.],
                  [ 6.,  5.,  4.,  1.,  0.]])

_ids = ['s1', 's2', 's3', 's4', 's5']

master_upgma_dm = skbio.DistanceMatrix(_data, _ids)

print(master_upgma_dm)
5x5 distance matrix
IDs:
's1', 's2', 's3', 's4', 's5'
Data:
[[ 0.  4.  2.  5.  6.]
 [ 4.  0.  3.  6.  5.]
 [ 2.  3.  0.  3.  4.]
 [ 5.  6.  3.  0.  1.]
 [ 6.  5.  4.  1.  0.]]

Iteration 1

Step 1: The smallest non-zero distance in the above matrix is between s4 and s5. So, we'll draw that clade and set each branch length to half of the distance between them.

Step 2: Next, we'll create a new, smaller distance matrix where the sequences s4 and s5 are now represented by a single clade which we'll call (s4, s5). This notation indicates that the corresponding distances are to both s4 and s5.

In [35]:
iter1_ids = ['s1', 's2', 's3', '(s4, s5)']
iter1_dm = [[0.0,   4.0,  2.0, None],
            [4.0,   0.0,  3.0, None],
            [2.0,   3.0,  0.0, None],
            [None, None, None, None]]

Step 3: We'll now fill in the values from the new clade to each of the existing sequences (or clades). The distance will be the mean between each pre-existing clade, and each of the sequences in the new clade. For example, the distance between s1 and (s4, s5) is the mean of the distance between s1 and s4 and s1 and s5:

In [36]:
import numpy as np

# Note: we are not using the iter1_dm, but the original matrix (master_upgma_dm)
s1_s4s5 = np.mean([master_upgma_dm['s1', 's4'], master_upgma_dm['s1', 's5']])
print(s1_s4s5)
5.5

Similarly, the distance between s2 and (s4, s5) is the mean of the distance between s2 and s4 and s2 and s5:

In [37]:
s2_s4s5 = np.mean([master_upgma_dm['s2', 's4'], master_upgma_dm['s2', 's5']])
print(s2_s4s5)
5.5

And finally, the distance between s3 and (s4, s5) is the mean of the distance between s3 and s4 and the distance between s3 and s5:

In [38]:
s3_s4s5 = np.mean([master_upgma_dm['s3', 's4'], master_upgma_dm['s3', 's5']])
print(s3_s4s5)
3.5

We can fill these values in to our iteration 1 distance matrix. Why do we only need to compute three values to fill in seven cells in this distance matrix?

In [39]:
iter1_dm = [[0.0, 4.0, 2.0, s1_s4s5],
      [4.0, 0.0, 3.0, s2_s4s5],
      [2.0, 3.0, 0.0, s3_s4s5],
      [s1_s4s5, s2_s4s5, s3_s4s5, 0.0]]

iter1_dm = DistanceMatrix(iter1_dm, iter1_ids)
print(iter1_dm)
4x4 distance matrix
IDs:
's1', 's2', 's3', '(s4, s5)'
Data:
[[ 0.   4.   2.   5.5]
 [ 4.   0.   3.   5.5]
 [ 2.   3.   0.   3.5]
 [ 5.5  5.5  3.5  0. ]]

Step 4: Because there is still more than one value below the diagonal in our new distance matrix, we start a new iteration by going back to Step 1 and repeating this process.

Iteration 2

Step 1: The smallest non-zero distance in the iteration 1 distance matrix is between s1 and s3. So, we'll draw that clade and set each branch length to half of that distance.

Step 2: We next create a new, smaller distance matrix where the sequences s1 and s3 are now represented by a single clade, (s1, s3).

In [40]:
iter2_ids = ['(s1, s3)', 's2', '(s4, s5)']
iter2_dm = [[None, None, None],
      [None,  0.0, 5.5],
      [None,  5.5, 0.0]]

Step 3: We'll now fill in the values from the new clade to each of the existing sequences (or clades). Notice that the distance between our new clade and s2 is the mean of two values, but the distance between our new clade and the clade defined in iteration 1 is the mean of four values. Why is that?

In [41]:
# For combining on s2 (index 1)
s2_s1s3 = np.mean([master_upgma_dm[1][0], master_upgma_dm[1][2]])

# For combining on s4 and s5 (referencing original master index)
s3s4_s1s3 = np.mean([master_upgma_dm[0][3], master_upgma_dm[0][4], master_upgma_dm[2][3], master_upgma_dm[2][4]])

We can now fill in all of the distances in our iteration 2 distance matrix.

In [42]:
iter2_dm = [[0.0, s2_s1s3, s3s4_s1s3],
            [s2_s1s3, 0.0, 5.5],
            [s3s4_s1s3, 5.5, 0.0]]

iter2_dm = DistanceMatrix(iter2_dm, iter2_ids)
print(iter2_dm)
3x3 distance matrix
IDs:
'(s1, s3)', 's2', '(s4, s5)'
Data:
[[ 0.   3.5  4.5]
 [ 3.5  0.   5.5]
 [ 4.5  5.5  0. ]]

Step 4: There is still more than one value below the diagonal, so we start a new iteration by again repeating the process.

Iteration 3

Step 1: The smallest non-zero distance in the above matrix is now between (s1, s3) and s2. So, we'll draw that clade and set each branch length to half of the distance.

Step 2: We'll next create a new distance matrix where the clade (s1, s3) and the sequence s2 are now represented by a single clade, ((s1, s3), s2).

In [43]:
iter3_ids = ['((s1, s3), s2)', '(s4, s5)']
iter3_dm = [[None, None],
            [None,  0.0]]

Step 3: We'll now fill in the values from the new clade to each of the existing sequences (or clades). This is now the mean of six distances. Why?

In [44]:
s1s2s3_s4s5 = np.mean([master_upgma_dm[0][3], master_upgma_dm[0][4],
                       master_upgma_dm[2][3], master_upgma_dm[2][4],
                       master_upgma_dm[1][3], master_upgma_dm[1][4]])

We fill this value into our iteration 3 distance matrix.

In [45]:
iter3_dm = [[0.0, s1s2s3_s4s5],
            [s1s2s3_s4s5, 0.0]]

iter3_dm = DistanceMatrix(iter3_dm, iter3_ids)
print(iter3_dm)
2x2 distance matrix
IDs:
'((s1, s3), s2)', '(s4, s5)'
Data:
[[ 0.          4.83333333]
 [ 4.83333333  0.        ]]

Step 4: At this stage, there is only one distance below the diagonal in our distance matrix. So, we can use that distance to draw the final branch. This will connect our two deepest clades, ((s1, s3), s2) and (s4, s5), which will give us our final UPGMA tree.

2.4.6.5.1 Applying UPGMA from SciPy [edit]

SciPy contains an implementation of UPGMA that we can apply to our existing distance matrices, and we can then visualize the resulting trees with ete3. IAB provides a wrapper function that will give this an interface that is convenient to work with. If you'd like to see what the wrapper function is doing, using the psource IPython magic function as we have in other places in the text.

In [46]:
from iab.algorithms import tree_from_distance_matrix, guide_tree_from_sequences
%psource tree_from_distance_matrix
#%psource guide_tree_from_sequences
In [47]:
from scipy.cluster.hierarchy import average
help(average)
Help on function average in module scipy.cluster.hierarchy:

average(y)
    Performs average/UPGMA linkage on a condensed distance matrix
    
    Parameters
    ----------
    y : ndarray
        The upper triangular of the distance matrix. The result of
        ``pdist`` is returned in this form.
    
    Returns
    -------
    Z : ndarray
        A linkage matrix containing the hierarchical clustering. See
        the ``linkage`` function documentation for more information
        on its structure.
    
    See Also
    --------
    linkage: for advanced creation of hierarchical clusterings.

Let's compute and visualize UPGMA trees for the two distance matrices that we created above. How do these trees compare to one another? Does one look more or less correct than the other (they may or may not, depending on the random sample of sequences that are being compared).

One thing to be aware of as you start visualizing trees is that the vertical order (in the default TreeStyle being used here) doesn't have biological meaning, it's purely a visualization component. You can rotate the branches descending from any node in the tree freely.

In [48]:
kmer_tree = tree_from_distance_matrix(kmer_dm, 
                                      metric='upgma')

print(type(kmer_tree))

ete3.Tree(str(kmer_tree), format=1).render("%%inline", tree_style=ts)
<class 'skbio.tree._tree.TreeNode'>
Out[48]:
In [49]:
jc_corrected_hamming_tree = tree_from_distance_matrix(jc_correct_dm(hamming_dm), 
                                                      metric='upgma')

ete3.Tree(str(jc_corrected_hamming_tree), format=1).render("%%inline", tree_style=ts)
Out[49]:

Which is better?

Which produces a higher percentage of agreement between leaf pairs?

How could we compute that?

2.4.6.5.2 Understanding the name [edit]

As mentioned above, UPGMA has a rather complex sounding name: Unweighted Pair Group metric with Arithmetic mean.

  • The Unweighted term indicates that all tip-to-tip distances contribute equally to each average that is computed (no weighted averages are being computed).
  • The Pair Group term implies that all internal nodes, including the root node, will be strictly bifurcating, or descent to exactly two other nodes (either internal or terminal).
  • The Arithmetic mean term implies that distances to each clade are the mean of distances to all members of that clade.

2.4.6.6 Phylogenetic reconstruction with neighbor-joining [edit]

One invalid assumption that is made by UPGMA is inherent in Step 1, where each branch connecting the internal node to a tip is set to half of the length between the tips.

  • This assumes the mutation rates are constant throughout the tree, or in other words that the tree is ultrametric.
  • This is not likely to be the case in the real world, as different lineages in the tree might be undergoing different selective pressures, leading to different rates of evolution.

Neighboring joining is a distance-based phylogenetic reconstruction approach that does not assume ultrametricity.

image.png

In [50]:
nj_tree = tree_from_distance_matrix(jc_correct_dm(hamming_dm), 
                                    metric='nj')

ete3.Tree(str(nj_tree), format=1).render("%%inline", tree_style=ts)
Out[50]:

2.4.10 Rooted versus unrooted trees [edit]

The following is a rooted tree, which means that it includes an assumption about the last common ancestor of all sequences represented in the tree.

https://en.wikipedia.org/wiki/Phylogenetic_tree#Rooted_tree

https://en.wikipedia.org/wiki/Tree_(graph_theory)#Rooted_tree

A rooted phylogenetic tree (see two graphics at top) is a directed tree with a unique node, the root, corresponding to the (usually imputed) most recent common ancestor of all the entities at the leaves of the tree.

  • The root node does not have a parent node, but serves as the parent of all other nodes in the tree.
  • The root is therefore a node of degree 2 while other internal nodes have a minimum degree of 3 (where "degree" here refers to the total number of incoming and outgoing edges).

The most common method for rooting trees is the use of an uncontroversial outgroup close enough to allow inference from trait data or molecular sequencing, but far enough to be a clear outgroup.

An unrooted tree, like the following, doesn't include an assumption about the last common ancestor of all sequences:

  • Unrooted trees illustrate the relatedness of the leaf nodes without making assumptions about ancestry. T
  • They do not require the ancestral root to be known or inferred.
  • Unrooted trees can always be generated from rooted ones by simply omitting the root.

By contrast, inferring the root of an unrooted tree requires some means of identifying ancestry. This is normally done by including an outgroup in the input data so that the root is necessarily between the outgroup and the rest of the taxa in the tree, or by introducing additional assumptions about the relative rates of evolution on each branch, such as an application of the molecular clock hypothesis.

2.4.11 What about horizontal gene transfer? [edit]

  • Horizontal gene transfer (HGT) or lateral gene transfer (LGT)is the movement of genetic material between unicellular and/or multicellular organisms other than by the ("vertical") transmission of DNA from parent to offspring.
  • Horizontal gene transfer is the primary mechanism for the spread of antibiotic resistance in bacteria, plays an important role in the evolution of bacteria that can degrade novel compounds such as human-created pesticides and in the evolution, maintenance, and transmission of virulence.
  • Most thinking in genetics has focused upon vertical transfer, but horizontal gene transfer is important, and among single-celled organisms is perhaps the dominant form of genetic transfer.

image.png

image.png

Tree of life showing vertical and horizontal gene transfers:

image.png

image.png

A phylogenetic network or reticulation is any graph used to visualize evolutionary relationships (either abstractly or explicitly) between nucleotide sequences, genes, chromosomes, genomes, or species.

  • They are employed when reticulation events such as hybridization, horizontal gene transfer, recombination, or gene duplication and loss are believed to be involved.
  • They differ from phylogenetic trees by the explicit modeling of richly linked networks, by means of the addition of hybrid nodes (nodes with two parents) instead of only tree nodes (a hierarchy of nodes, each with only one parent).
  • Phylogenetic trees are a subset of phylogenetic networks.

image.png

Phylogenetic network of HVS-I variation in mitochondrial DNA haplogroup R0 and R0b

Mini lab

Improving simulation assumptions

Q1

  • Does our ability to re-construct improve with sequence length? Show a couple tests to see, using hamming/kmer and/or upgma/nj. Note: you don't need a totally computational metric of quality, but at least couple plots where you manually assess quality for different starting parent sequence lengths.
    • Does it matter the method?
    • If it changes, is the function linear, increasing, decreasing?
In [51]:
# Do coding here, 
# snap screenshots of code/plot, 
# and write a paragraph in Canvas, 
# with screenshots embedded in the text entry box

# little helper function
def mean_distance(D):
    sum = 0
    for i in range(len(D)):
        for j in range(len(D)):
            if i != j:
                sum += D[i][j]
    return sum/(len(D)*(len(D)-1))

sequences = []
sequences_aligned = []

kmer_av_distances = []
hamming_av_distances = []

import pandas as pd
print(hamming_av_distances)
print(kmer_av_distances)

# entries go up by 50 sequences
def mean_line_graph(hamming_dists, kmer_dists):
    cols = []
    for i in range(len(hamming_dists)):
        val = (i+1)*50
        cols.append(str('%d seqs'%(val)))

    df = pd.DataFrame({
                        'hamming': hamming_dists,
                        'kmer': kmer_dists
                    },index=cols)
    df.plot.line()

def create_evolutions(seq_count, indel_prob):
    # create a data set
    sequence = random_sequence(skbio.DNA, seq_count)
    sequences = evolve_generations(sequence, generations=3, 
                                   substitution_probability=0.1, 
                                   indel_probability=indel_prob,
                                   increased_rate_probability=0.1, 
                                   verbose=False)

    return sequences

def create_kmer_dm(sequences, distances, length=50):
    kmer_dm = DistanceMatrix.from_iterable(sequences, 
                                           metric=kmer_distance, 
                                           key='id')
    _title = '3mer distances between sequences length %d' %(length)
    _ = kmer_dm.plot(cmap='Greens', 
                     title=_title)

    df = kmer_dm.to_data_frame()
    df = df.values.tolist()
    distances.append(mean_distance(df))
    return kmer_dm

def create_hamming_dm(sequences, distances, length=50):
    if indel_probability == 0:
        sequences_aligned = sequences
    else:
        gpa = partial(global_pairwise_align_nucleotide, 
                      penalize_terminal_gaps=True)

        sequences_aligned = progressive_msa(sequences, 
                                            pairwise_aligner=gpa)

    hamming_dm = DistanceMatrix.from_iterable(sequences_aligned, 
                                               metric=hamming, 
                                               key='id')
    _title='Hamming distances between sequences length %d'%(length)
    _ = hamming_dm.plot(cmap='Greens', 
                     title=_title)
    
    df = hamming_dm.to_data_frame()
    df = df.values.tolist()
    hamming_av_distances.append(mean_distance(df))
    return hamming_dm

def create_kmer_tree(kmer_dm):
    kmer_tree = tree_from_distance_matrix(jc_correct_dm(kmer_dm), 
                                      metric='nj')

    ete3.Tree(str(kmer_tree), format=1).render("%%inline", tree_style=ts)

def create_hamming_tree(kmer_dm):
    kmer_tree = tree_from_distance_matrix(jc_correct_dm(kmer_dm), 
                                      metric='nj')

    ete3.Tree(str(kmer_tree), format=1).render("%%inline", tree_style=ts)
    
def create_both_trees(kmer,hamming):
    create_kmer_tree(kmer)
    create_hamming_tree(hamming)
    
def run_all_graphs(iter_counts, indel_prob, kmer_av, hamming_av):
    for seq_count in iter_counts:
        seqs = create_evolutions(seq_count, indel_prob)
        hdm  = create_hamming_dm(seqs, hamming_av, seq_count)
        kdm  = create_kmer_dm(seqs, kmer_av, seq_count)
        create_both_trees(kdm,hdm)
    
[]
[]
In [52]:
hamming_av_distances =  []
kmer_av_distances = []
seqs = []
for i in range(8):
    seqs.append((i+1)*50)
run_all_graphs(seqs, 0, kmer_av_distances, hamming_av_distances)
mean_line_graph(hamming_av_distances, kmer_av_distances)

Q2

  • Does our ability to re-construct improve with mutation rates that are uneven across the DNA segment, e.g., some regions being hyper-variable, and others being less so? No plots are needed for this one, just arm-chair thinking. Write a pagagraph in Canvas.